Merge branch '3.0'

This commit is contained in:
Jaiden Mispy 2014-12-05 22:57:41 +11:00
commit 56aadea555
20 changed files with 738 additions and 15203 deletions

2
.gitignore vendored
View file

@ -1,3 +1,5 @@
.*.swp .*.swp
Gemfile.lock Gemfile.lock
pkg pkg
.yardoc
doc

View file

@ -4,8 +4,16 @@
[![Build Status](https://travis-ci.org/mispy/twitter_ebooks.svg)](https://travis-ci.org/mispy/twitter_ebooks) [![Build Status](https://travis-ci.org/mispy/twitter_ebooks.svg)](https://travis-ci.org/mispy/twitter_ebooks)
[![Dependency Status](https://gemnasium.com/mispy/twitter_ebooks.svg)](https://gemnasium.com/mispy/twitter_ebooks) [![Dependency Status](https://gemnasium.com/mispy/twitter_ebooks.svg)](https://gemnasium.com/mispy/twitter_ebooks)
A framework for building interactive twitterbots which respond to mentions/DMs. twitter_ebooks tries to be a good friendly bot citizen by avoiding infinite conversations and spamming people, so you only have to write the interesting parts.
Rewrite of my twitter\_ebooks code. While the original was solely a tweeting Markov generator, this framework helps you build any kind of interactive twitterbot which responds to mentions/DMs. See [ebooks\_example](https://github.com/mispy/ebooks_example) for an example of a full bot. ## New in 3.0
- Bots run in their own threads (no eventmachine), and startup is parallelized
- Bots start with `ebooks start`, and no longer die on unhandled exceptions
- `ebooks auth` command will create new access tokens, for running multiple bots
- `ebooks console` starts a ruby interpreter with bots loaded (see Ebooks::Bot.all)
- Replies are slightly rate-limited to prevent infinite bot convos
- Non-participating users in a mention chain will be dropped after a few tweets
## Installation ## Installation
@ -21,53 +29,63 @@ Run `ebooks new <reponame>` to generate a new repository containing a sample bot
``` ruby ``` ruby
# This is an example bot definition with event handlers commented out # This is an example bot definition with event handlers commented out
# You can define as many of these as you like; they will run simultaneously # You can define and instantiate as many bots as you like
Ebooks::Bot.new("abby_ebooks") do |bot| class MyBot < Ebooks::Bot
# Consumer details come from registering an app at https://dev.twitter.com/ # Configuration here applies to all MyBots
# OAuth details can be fetched with https://github.com/marcel/twurl def configure
bot.consumer_key = "" # Your app consumer key # Consumer details come from registering an app at https://dev.twitter.com/
bot.consumer_secret = "" # Your app consumer secret # Once you have consumer details, use "ebooks auth" for new access tokens
bot.oauth_token = "" # Token connecting the app to this account self.consumer_key = '' # Your app consumer key
bot.oauth_token_secret = "" # Secret connecting the app to this account self.consumer_secret = '' # Your app consumer secret
bot.on_startup do # Users to block instead of interacting with
# Run some startup task self.blacklist = ['tnietzschequote']
# puts "I'm ready!"
# Range in seconds to randomize delay when bot.delay is called
self.delay_range = 1..6
end end
bot.on_message do |dm| def on_startup
scheduler.every '24h' do
# Tweet something every 24 hours
# See https://github.com/jmettraux/rufus-scheduler
# bot.tweet("hi")
# bot.pictweet("hi", "cuteselfie.jpg")
end
end
def on_message(dm)
# Reply to a DM # Reply to a DM
# bot.reply(dm, "secret secrets") # bot.reply(dm, "secret secrets")
end end
bot.on_follow do |user| def on_follow(user)
# Follow a user back # Follow a user back
# bot.follow(user[:screen_name]) # bot.follow(user[:screen_name])
end end
bot.on_mention do |tweet, meta| def on_mention(tweet)
# Reply to a mention # Reply to a mention
# bot.reply(tweet, meta[:reply_prefix] + "oh hullo") # bot.reply(tweet, meta(tweet)[:reply_prefix] + "oh hullo")
end end
bot.on_timeline do |tweet, meta| def on_timeline(tweet)
# Reply to a tweet in the bot's timeline # Reply to a tweet in the bot's timeline
# bot.reply(tweet, meta[:reply_prefix] + "nice tweet") # bot.reply(tweet, meta(tweet)[:reply_prefix] + "nice tweet")
end end
end
bot.scheduler.every '24h' do # Make a MyBot and attach it to an account
# Tweet something every 24 hours MyBot.new("{{BOT_NAME}}") do |bot|
# See https://github.com/jmettraux/rufus-scheduler bot.access_token = "" # Token connecting the app to this account
# bot.tweet("hi") bot.access_token_secret = "" # Secret connecting the app to this account
# bot.pictweet("hi", "cuteselfie.jpg", ":possibly_sensitive => true")
end
end end
``` ```
Bots defined like this can be spawned by executing `run.rb` in the same directory, and will operate together in a single eventmachine loop. The easiest way to run bots in a semi-permanent fashion is with [Heroku](https://www.heroku.com); just make an app, push the bot repository to it, enable a worker process in the web interface and it ought to chug along merrily forever. 'ebooks start' will run all defined bots in their own threads. The easiest way to run bots in a semi-permanent fashion is with [Heroku](https://www.heroku.com); just make an app, push the bot repository to it, enable a worker process in the web interface and it ought to chug along merrily forever.
The underlying [tweetstream](https://github.com/tweetstream/tweetstream) and [twitter gem](https://github.com/sferik/twitter) client objects can be accessed at `bot.stream` and `bot.twitter` respectively. The underlying streaming and REST clients from the [twitter gem](https://github.com/sferik/twitter) can be accessed at `bot.stream` and `bot.twitter` respectively.
## Archiving accounts ## Archiving accounts
@ -102,7 +120,6 @@ Text files use newlines and full stops to seperate statements.
Once you have a model, the primary use is to produce statements and related responses to input, using a pseudo-Markov generator: Once you have a model, the primary use is to produce statements and related responses to input, using a pseudo-Markov generator:
``` ruby ``` ruby
> require 'twitter_ebooks'
> model = Ebooks::Model.load("model/0xabad1dea.model") > model = Ebooks::Model.load("model/0xabad1dea.model")
> model.make_statement(140) > model.make_statement(140)
=> "My Terrible Netbook may be the kind of person who buys Starbucks, but this Rackspace vuln is pretty straight up a backdoor" => "My Terrible Netbook may be the kind of person who buys Starbucks, but this Rackspace vuln is pretty straight up a backdoor"
@ -113,14 +130,18 @@ Once you have a model, the primary use is to produce statements and related resp
The secondary function is the "interesting keywords" list. For example, I use this to determine whether a bot wants to fav/retweet/reply to something in its timeline: The secondary function is the "interesting keywords" list. For example, I use this to determine whether a bot wants to fav/retweet/reply to something in its timeline:
``` ruby ``` ruby
top100 = model.keywords.top(100) top100 = model.keywords.take(100)
tokens = Ebooks::NLP.tokenize(tweet[:text]) tokens = Ebooks::NLP.tokenize(tweet[:text])
if tokens.find { |t| top100.include?(t) } if tokens.find { |t| top100.include?(t) }
bot.twitter.favorite(tweet[:id]) bot.favorite(tweet[:id])
end end
``` ```
## Bot niceness
## Other notes ## Other notes
If you're using Heroku, which has no persistent filesystem, automating the process of archiving, consuming and updating can be tricky. My current solution is just a daily cron job which commits and pushes for me, which is pretty hacky. If you're using Heroku, which has no persistent filesystem, automating the process of archiving, consuming and updating can be tricky. My current solution is just a daily cron job which commits and pushes for me, which is pretty hacky.

View file

@ -2,54 +2,85 @@
# encoding: utf-8 # encoding: utf-8
require 'twitter_ebooks' require 'twitter_ebooks'
require 'csv' require 'ostruct'
$debug = true module Ebooks::Util
def pretty_exception(e)
module Ebooks end
end
module Ebooks::CLI
APP_PATH = Dir.pwd # XXX do some recursive thing instead APP_PATH = Dir.pwd # XXX do some recursive thing instead
HELP = OpenStruct.new
def self.new(reponame) HELP.default = <<STR
usage = <<STR Usage:
Usage: ebooks new <reponame> ebooks help <command>
Creates a new skeleton repository defining a template bot in ebooks new <reponame>
the current working directory specified by <reponame>. ebooks auth
ebooks consume <corpus_path> [corpus_path2] [...]
ebooks consume-all <corpus_path> [corpus_path2] [...]
ebooks gen <model_path> [input]
ebooks archive <username> [path]
ebooks tweet <model_path> <botname>
STR STR
def self.help(command=nil)
if command.nil?
log HELP.default
else
log HELP[command].gsub(/^ {4}/, '')
end
end
HELP.new = <<-STR
Usage: ebooks new <reponame>
Creates a new skeleton repository defining a template bot in
the current working directory specified by <reponame>.
STR
def self.new(reponame)
if reponame.nil? if reponame.nil?
log usage help :new
exit exit 1
end end
path = "./#{reponame}" path = "./#{reponame}"
if File.exists?(path) if File.exists?(path)
log "#{path} already exists. Please remove if you want to recreate." log "#{path} already exists. Please remove if you want to recreate."
exit exit 1
end end
FileUtils.cp_r(SKELETON_PATH, path) FileUtils.cp_r(Ebooks::SKELETON_PATH, path)
File.open(File.join(path, 'bots.rb'), 'w') do |f| File.open(File.join(path, 'bots.rb'), 'w') do |f|
template = File.read(File.join(SKELETON_PATH, 'bots.rb')) template = File.read(File.join(Ebooks::SKELETON_PATH, 'bots.rb'))
f.write(template.gsub("{{BOT_NAME}}", reponame)) f.write(template.gsub("{{BOT_NAME}}", reponame))
end end
File.open(File.join(path, 'Gemfile'), 'w') do |f|
template = File.read(File.join(Ebooks::SKELETON_PATH, 'Gemfile'))
f.write(template.gsub("{{RUBY_VERSION}}", RUBY_VERSION))
end
log "New twitter_ebooks app created at #{reponame}" log "New twitter_ebooks app created at #{reponame}"
end end
HELP.consume = <<-STR
Usage: ebooks consume <corpus_path> [corpus_path2] [...]
Processes some number of text files or json tweet corpuses
into usable models. These will be output at model/<name>.model
STR
def self.consume(pathes) def self.consume(pathes)
usage = <<STR
Usage: ebooks consume <corpus_path> [corpus_path2] [...]
Processes some number of text files or json tweet corpuses
into usable models. These will be output at model/<name>.model
STR
if pathes.empty? if pathes.empty?
log usage help :consume
exit exit 1
end end
pathes.each do |path| pathes.each do |path|
@ -57,50 +88,43 @@ STR
shortname = filename.split('.')[0..-2].join('.') shortname = filename.split('.')[0..-2].join('.')
outpath = File.join(APP_PATH, 'model', "#{shortname}.model") outpath = File.join(APP_PATH, 'model', "#{shortname}.model")
Model.consume(path).save(outpath) Ebooks::Model.consume(path).save(outpath)
log "Corpus consumed to #{outpath}" log "Corpus consumed to #{outpath}"
end end
end end
HELP.consume_all = <<-STR
Usage: ebooks consume-all <name> <corpus_path> [corpus_path2] [...]
Processes some number of text files or json tweet corpuses
into one usable model. It will be output at model/<name>.model
STR
def self.consume_all(name, paths) def self.consume_all(name, paths)
usage = <<STR
Usage: ebooks consume-all <name> <corpus_path> [corpus_path2] [...]
Processes some number of text files or json tweet corpuses
into one usable model. It will be output at model/<name>.model
STR
if paths.empty? if paths.empty?
log usage help :consume_all
exit exit 1
end end
outpath = File.join(APP_PATH, 'model', "#{name}.model") outpath = File.join(APP_PATH, 'model', "#{name}.model")
#pathes.each do |path| Ebooks::Model.consume_all(paths).save(outpath)
# filename = File.basename(path)
# shortname = filename.split('.')[0..-2].join('.')
#
# outpath = File.join(APP_PATH, 'model', "#{shortname}.model")
# Model.consume(path).save(outpath)
# log "Corpus consumed to #{outpath}"
#end
Model.consume_all(paths).save(outpath)
log "Corpuses consumed to #{outpath}" log "Corpuses consumed to #{outpath}"
end end
def self.gen(model_path, input) HELP.gen = <<-STR
usage = <<STR Usage: ebooks gen <model_path> [input]
Usage: ebooks gen <model_path> [input]
Make a test tweet from the processed model at <model_path>. Make a test tweet from the processed model at <model_path>.
Will respond to input if provided. Will respond to input if provided.
STR STR
def self.gen(model_path, input)
if model_path.nil? if model_path.nil?
log usage help :gen
exit exit 1
end end
model = Model.load(model_path) model = Ebooks::Model.load(model_path)
if input && !input.empty? if input && !input.empty?
puts "@cmd " + model.make_response(input, 135) puts "@cmd " + model.make_response(input, 135)
else else
@ -108,81 +132,186 @@ STR
end end
end end
def self.score(model_path, input) HELP.archive = <<-STR
usage = <<STR Usage: ebooks archive <username> [outpath]
Usage: ebooks score <model_path> <input>
Scores "interest" in some text input according to how Downloads a json corpus of the <username>'s tweets.
well unique keywords match the model. Output defaults to corpus/<username>.json
STR Due to API limitations, this can only receive up to ~3000 tweets
if model_path.nil? || input.nil? into the past.
log usage STR
exit
def self.archive(username, outpath=nil)
if username.nil?
help :archive
exit 1
end end
model = Model.load(model_path) Ebooks::Archive.new(username, outpath).sync
model.score_interest(input)
end end
def self.archive(username, outpath) HELP.tweet = <<-STR
usage = <<STR Usage: ebooks tweet <model_path> <botname>
Usage: ebooks archive <username> <outpath>
Downloads a json corpus of the <username>'s tweets to <outpath>. Sends a public tweet from the specified bot using text
Due to API limitations, this can only receive up to ~3000 tweets from the processed model at <model_path>.
into the past. STR
STR
if username.nil? || outpath.nil?
log usage
exit
end
Archive.new(username, outpath).sync
end
def self.tweet(modelpath, botname) def self.tweet(modelpath, botname)
usage = <<STR
Usage: ebooks tweet <model_path> <botname>
Sends a public tweet from the specified bot using text
from the processed model at <model_path>.
STR
if modelpath.nil? || botname.nil? if modelpath.nil? || botname.nil?
log usage help :tweet
exit exit 1
end end
load File.join(APP_PATH, 'bots.rb') load File.join(APP_PATH, 'bots.rb')
model = Model.load(modelpath) model = Ebooks::Model.load(modelpath)
statement = model.make_statement statement = model.make_statement
log "@#{botname}: #{statement}" bot = Ebooks::Bot.get(botname)
bot = Bot.get(botname)
bot.configure bot.configure
bot.tweet(statement) bot.tweet(statement)
end end
def self.c HELP.auth = <<-STR
Usage: ebooks auth
Authenticates your Twitter app for any account. By default, will
use the consumer key and secret from the first defined bot. You
can specify another by setting the CONSUMER_KEY and CONSUMER_SECRET
environment variables.
STR
def self.auth
consumer_key, consumer_secret = find_consumer
require 'oauth'
consumer = OAuth::Consumer.new(
consumer_key,
consumer_secret,
site: 'https://twitter.com/',
scheme: :header
)
request_token = consumer.get_request_token
auth_url = request_token.authorize_url()
pin = nil
loop do
log auth_url
log "Go to the above url and follow the prompts, then enter the PIN code here."
print "> "
pin = STDIN.gets.chomp
break unless pin.empty?
end
access_token = request_token.get_access_token(oauth_verifier: pin)
log "Account authorized successfully. Make sure to put these in your bots.rb!\n" +
" access token: #{access_token.token}\n" +
" access token secret: #{access_token.secret}"
end
HELP.console = <<-STR
Usage: ebooks c[onsole]
Starts an interactive ruby session with your bots loaded
and configured.
STR
def self.console
load_bots
require 'pry'; Ebooks.module_exec { pry }
end
HELP.start = <<-STR
Usage: ebooks s[tart] [botname]
Starts running bots. If botname is provided, only runs that bot.
STR
def self.start(botname=nil)
load_bots
if botname.nil?
bots = Ebooks::Bot.all
else
bots = Ebooks::Bot.all.select { |bot| bot.username == botname }
if bots.empty?
log "Couldn't find a defined bot for @#{botname}!"
exit 1
end
end
threads = []
bots.each do |bot|
threads << Thread.new { bot.prepare }
end
threads.each(&:join)
threads = []
bots.each do |bot|
threads << Thread.new do
loop do
begin
bot.start
rescue Exception => e
bot.log e.inspect
puts e.backtrace.map { |s| "\t"+s }.join("\n")
end
bot.log "Sleeping before reconnect"
sleep 5
end
end
end
threads.each(&:join)
end
# Non-command methods
def self.find_consumer
if ENV['CONSUMER_KEY'] && ENV['CONSUMER_SECRET']
log "Using consumer details from environment variables:\n" +
" consumer key: #{ENV['CONSUMER_KEY']}\n" +
" consumer secret: #{ENV['CONSUMER_SECRET']}"
return [ENV['CONSUMER_KEY'], ENV['CONSUMER_SECRET']]
end
load_bots
consumer_key = nil
consumer_secret = nil
Ebooks::Bot.all.each do |bot|
if bot.consumer_key && bot.consumer_secret
consumer_key = bot.consumer_key
consumer_secret = bot.consumer_secret
log "Using consumer details from @#{bot.username}:\n" +
" consumer key: #{bot.consumer_key}\n" +
" consumer secret: #{bot.consumer_secret}\n"
return consumer_key, consumer_secret
end
end
if consumer_key.nil? || consumer_secret.nil?
log "Couldn't find any consumer details to auth an account with.\n" +
"Please either configure a bot with consumer_key and consumer_secret\n" +
"or provide the CONSUMER_KEY and CONSUMER_SECRET environment variables."
exit 1
end
end
def self.load_bots
load 'bots.rb' load 'bots.rb'
require 'pry'; pry
if Ebooks::Bot.all.empty?
puts "Couldn't find any bots! Please make sure bots.rb instantiates at least one bot."
end
end end
def self.command(args) def self.command(args)
usage = <<STR
Usage:
ebooks new <reponame>
ebooks consume <corpus_path> [corpus_path2] [...]
ebooks consume-all <corpus_path> [corpus_path2] [...]
ebooks gen <model_path> [input]
ebooks score <model_path> <input>
ebooks archive <@user> <outpath>
ebooks tweet <model_path> <botname>
STR
if args.length == 0 if args.length == 0
log usage help
exit exit 1
end end
case args[0] case args[0]
@ -190,16 +319,21 @@ STR
when "consume" then consume(args[1..-1]) when "consume" then consume(args[1..-1])
when "consume-all" then consume_all(args[1], args[2..-1]) when "consume-all" then consume_all(args[1], args[2..-1])
when "gen" then gen(args[1], args[2..-1].join(' ')) when "gen" then gen(args[1], args[2..-1].join(' '))
when "score" then score(args[1], args[2..-1].join(' '))
when "archive" then archive(args[1], args[2]) when "archive" then archive(args[1], args[2])
when "tweet" then tweet(args[1], args[2]) when "tweet" then tweet(args[1], args[2])
when "jsonify" then jsonify(args[1..-1]) when "jsonify" then jsonify(args[1..-1])
when "c" then c when "auth" then auth
when "console" then console
when "c" then console
when "start" then start(args[1])
when "s" then start(args[1])
when "help" then help(args[1])
else else
log usage log "No such command '#{args[0]}'"
help
exit 1 exit 1
end end
end end
end end
Ebooks.command(ARGV) Ebooks::CLI.command(ARGV)

View file

@ -11,11 +11,11 @@ module Ebooks
SKELETON_PATH = File.join(GEM_PATH, 'skeleton') SKELETON_PATH = File.join(GEM_PATH, 'skeleton')
TEST_PATH = File.join(GEM_PATH, 'test') TEST_PATH = File.join(GEM_PATH, 'test')
TEST_CORPUS_PATH = File.join(TEST_PATH, 'corpus/0xabad1dea.tweets') TEST_CORPUS_PATH = File.join(TEST_PATH, 'corpus/0xabad1dea.tweets')
INTERIM = :interim
end end
require 'twitter_ebooks/nlp' require 'twitter_ebooks/nlp'
require 'twitter_ebooks/archive' require 'twitter_ebooks/archive'
require 'twitter_ebooks/markov'
require 'twitter_ebooks/suffix' require 'twitter_ebooks/suffix'
require 'twitter_ebooks/model' require 'twitter_ebooks/model'
require 'twitter_ebooks/bot' require 'twitter_ebooks/bot'

View file

@ -39,9 +39,14 @@ module Ebooks
end end
end end
def initialize(username, path, client=nil) def initialize(username, path=nil, client=nil)
@username = username @username = username
@path = path || "#{username}.json" @path = path || "corpus/#{username}.json"
if File.directory?(@path)
@path = File.join(@path, "#{username}.json")
end
@client = client || make_client @client = client || make_client
if File.exists?(@path) if File.exists?(@path)

409
lib/twitter_ebooks/bot.rb Executable file → Normal file
View file

@ -6,143 +6,91 @@ module Ebooks
class ConfigurationError < Exception class ConfigurationError < Exception
end end
# We track how many unprompted interactions the bot has had with # Represents a single reply tree of tweets
# each user and start dropping them from mentions after two in a row class Conversation
class UserInfo attr_reader :last_update
attr_reader :username
attr_accessor :pesters_left
def initialize(username) # @param bot [Ebooks::Bot]
@username = username def initialize(bot)
@pesters_left = 1 @bot = bot
end @tweets = []
def can_pester?
@pesters_left > 0
end
end
# Represents a current "interaction state" with another user
class Interaction
attr_reader :userinfo, :received, :last_update
def initialize(userinfo)
@userinfo = userinfo
@received = []
@last_update = Time.now @last_update = Time.now
end end
def receive(tweet) # @param tweet [Twitter::Tweet] tweet to add
@received << tweet def add(tweet)
@tweets << tweet
@last_update = Time.now @last_update = Time.now
@userinfo.pesters_left += 2
end end
# Make an informed guess as to whether this user is a bot # Make an informed guess as to whether a user is a bot based
# based on its username and reply speed # on their behavior in this conversation
def is_bot? def is_bot?(username)
if @received.length > 2 usertweets = @tweets.select { |t| t.user.screen_name == username }
if (@received[-1].created_at - @received[-3].created_at) < 30
if usertweets.length > 2
if (usertweets[-1].created_at - usertweets[-3].created_at) < 30
return true return true
end end
end end
@userinfo.username.include?("ebooks") username.include?("ebooks")
end end
def continue? # Figure out whether to keep this user in the reply prefix
if is_bot? # We want to avoid spamming non-participating users
true if @received.length < 2 def can_include?(username)
else @tweets.length <= 4 ||
true !@tweets[-4..-1].select { |t| t.user.screen_name == username }.empty?
end
end end
end end
class Bot # Meta information about a tweet that we calculate for ourselves
attr_accessor :consumer_key, :consumer_secret, class TweetMeta
:access_token, :access_token_secret # @return [Array<String>] usernames mentioned in tweet
attr_accessor :mentions
# @return [String] text of tweets with mentions removed
attr_accessor :mentionless
# @return [Array<String>] usernames to include in a reply
attr_accessor :reply_mentions
# @return [String] mentions to start reply with
attr_accessor :reply_prefix
# @return [Integer] available chars for reply
attr_accessor :limit
attr_reader :twitter, :stream, :thread # @return [Ebooks::Bot] associated bot
attr_accessor :bot
# Configuration # @return [Twitter::Tweet] associated tweet
attr_accessor :username, :delay_range, :blacklist attr_accessor :tweet
@@all = [] # List of all defined bots
def self.all; @@all; end
def self.get(name)
all.find { |bot| bot.username == name }
end
def log(*args)
STDOUT.print "@#{@username}: " + args.map(&:to_s).join(' ') + "\n"
STDOUT.flush
end
def initialize(*args, &b)
@username ||= nil
@blacklist ||= []
@delay_range ||= 0
@users ||= {}
@interactions ||= {}
configure(*args, &b)
# Tweet ids we've already observed, to avoid duplication
@seen_tweets ||= {}
end
def userinfo(username)
@users[username] ||= UserInfo.new(username)
end
def interaction(username)
if @interactions[username] &&
Time.now - @interactions[username].last_update < 600
@interactions[username]
else
@interactions[username] = Interaction.new(userinfo(username))
end
end
def twitter
@twitter ||= Twitter::REST::Client.new do |config|
config.consumer_key = @consumer_key
config.consumer_secret = @consumer_secret
config.access_token = @access_token
config.access_token_secret = @access_token_secret
end
end
def stream
@stream ||= Twitter::Streaming::Client.new do |config|
config.consumer_key = @consumer_key
config.consumer_secret = @consumer_secret
config.access_token = @access_token
config.access_token_secret = @access_token_secret
end
end
# Calculate some meta information about a tweet relevant for replying
def calc_meta(ev)
meta = {}
meta[:mentions] = ev.attrs[:entities][:user_mentions].map { |x| x[:screen_name] }
# Check whether this tweet mentions our bot
# @return [Boolean]
def mentions_bot?
# To check if this is someone talking to us, ensure: # To check if this is someone talking to us, ensure:
# - The tweet mentions list contains our username # - The tweet mentions list contains our username
# - The tweet is not being retweeted by somebody else # - The tweet is not being retweeted by somebody else
# - Or soft-retweeted by somebody else # - Or soft-retweeted by somebody else
meta[:mentions_bot] = meta[:mentions].map(&:downcase).include?(@username.downcase) && !ev.retweeted_status? && !ev.text.start_with?('RT ') @mentions.map(&:downcase).include?(@bot.username.downcase) && !@tweet.retweeted_status? && !@tweet.text.start_with?('RT ')
end
# @param bot [Ebooks::Bot]
# @param ev [Twitter::Tweet]
def initialize(bot, ev)
@bot = bot
@tweet = ev
@mentions = ev.attrs[:entities][:user_mentions].map { |x| x[:screen_name] }
# Process mentions to figure out who to reply to # Process mentions to figure out who to reply to
reply_mentions = meta[:mentions].reject { |m| m.downcase == @username.downcase } # i.e. not self and nobody who has seen too many secondary mentions
reply_mentions = reply_mentions.select { |username| userinfo(username).can_pester? } reply_mentions = @mentions.reject do |m|
meta[:reply_mentions] = [ev.user.screen_name] + reply_mentions username = m.downcase
username == @bot.username || !@bot.conversation(ev).can_include?(username)
end
@reply_mentions = ([ev.user.screen_name] + reply_mentions).uniq
meta[:reply_prefix] = meta[:reply_mentions].uniq.map { |m| '@'+m }.join(' ') + ' ' @reply_prefix = @reply_mentions.map { |m| '@'+m }.join(' ') + ' '
@limit = 140 - @reply_prefix.length
meta[:limit] = 140 - meta[:reply_prefix].length
mless = ev.text mless = ev.text
begin begin
@ -155,12 +103,116 @@ module Ebooks
p ev.text p ev.text
raise raise
end end
meta[:mentionless] = mless @mentionless = mless
end
end
meta class Bot
# @return [String] OAuth consumer key for a Twitter app
attr_accessor :consumer_key
# @return [String] OAuth consumer secret for a Twitter app
attr_accessor :consumer_secret
# @return [String] OAuth access token from `ebooks auth`
attr_accessor :access_token
# @return [String] OAuth access secret from `ebooks auth`
attr_accessor :access_token_secret
# @return [String] Twitter username of bot
attr_accessor :username
# @return [Array<String>] list of usernames to block on contact
attr_accessor :blacklist
# @return [Hash{String => Ebooks::Conversation}] maps tweet ids to their conversation contexts
attr_accessor :conversations
# @return [Range, Integer] range of seconds to delay in delay method
attr_accessor :delay_range
# @return [Array] list of all defined bots
def self.all; @@all ||= []; end
# Fetches a bot by username
# @param username [String]
# @return [Ebooks::Bot]
def self.get(username)
all.find { |bot| bot.username == username }
end
# Logs info to stdout in the context of this bot
def log(*args)
STDOUT.print "@#{@username}: " + args.map(&:to_s).join(' ') + "\n"
STDOUT.flush
end
# Initializes and configures bot
# @param args Arguments passed to configure method
# @param b Block to call with new bot
def initialize(username, &b)
@blacklist ||= []
@conversations ||= {}
# Tweet ids we've already observed, to avoid duplication
@seen_tweets ||= {}
@username = username
configure
b.call(self) unless b.nil?
Bot.all << self
end
# Find or create the conversation context for this tweet
# @param tweet [Twitter::Tweet]
# @return [Ebooks::Conversation]
def conversation(tweet)
conv = if tweet.in_reply_to_status_id?
@conversations[tweet.in_reply_to_status_id]
end
if conv.nil?
conv = @conversations[tweet.id] || Conversation.new(self)
end
if tweet.in_reply_to_status_id?
@conversations[tweet.in_reply_to_status_id] = conv
end
@conversations[tweet.id] = conv
# Expire any old conversations to prevent memory growth
@conversations.each do |k,v|
if v != conv && Time.now - v.last_update > 3600
@conversations.delete(k)
end
end
conv
end
# @return [Twitter::REST::Client] underlying REST client from twitter gem
def twitter
@twitter ||= Twitter::REST::Client.new do |config|
config.consumer_key = @consumer_key
config.consumer_secret = @consumer_secret
config.access_token = @access_token
config.access_token_secret = @access_token_secret
end
end
# @return [Twitter::Streaming::Client] underlying streaming client from twitter gem
def stream
@stream ||= Twitter::Streaming::Client.new do |config|
config.consumer_key = @consumer_key
config.consumer_secret = @consumer_secret
config.access_token = @access_token
config.access_token_secret = @access_token_secret
end
end
# Calculate some meta information about a tweet relevant for replying
# @param ev [Twitter::Tweet]
# @return [Ebooks::TweetMeta]
def meta(ev)
TweetMeta.new(self, ev)
end end
# Receive an event from the twitter stream # Receive an event from the twitter stream
# @param ev [Object] Twitter streaming event
def receive_event(ev) def receive_event(ev)
if ev.is_a? Array # Initial array sent on first connection if ev.is_a? Array # Initial array sent on first connection
log "Online!" log "Online!"
@ -181,7 +233,7 @@ module Ebooks
return unless ev.text # If it's not a text-containing tweet, ignore it return unless ev.text # If it's not a text-containing tweet, ignore it
return if ev.user.screen_name == @username # Ignore our own tweets return if ev.user.screen_name == @username # Ignore our own tweets
meta = calc_meta(ev) meta = meta(ev)
if blacklisted?(ev.user.screen_name) if blacklisted?(ev.user.screen_name)
log "Blocking blacklisted user @#{ev.user.screen_name}" log "Blocking blacklisted user @#{ev.user.screen_name}"
@ -190,17 +242,18 @@ module Ebooks
# Avoid responding to duplicate tweets # Avoid responding to duplicate tweets
if @seen_tweets[ev.id] if @seen_tweets[ev.id]
log "Not firing event for duplicate tweet #{ev.id}"
return return
else else
@seen_tweets[ev.id] = true @seen_tweets[ev.id] = true
end end
if meta[:mentions_bot] if meta.mentions_bot?
log "Mention from @#{ev.user.screen_name}: #{ev.text}" log "Mention from @#{ev.user.screen_name}: #{ev.text}"
interaction(ev.user.screen_name).receive(ev) conversation(ev).add(ev)
fire(:mention, ev, meta) fire(:mention, ev)
else else
fire(:timeline, ev, meta) fire(:timeline, ev)
end end
elsif ev.is_a?(Twitter::Streaming::DeletedTweet) || elsif ev.is_a?(Twitter::Streaming::DeletedTweet) ||
@ -211,7 +264,31 @@ module Ebooks
end end
end end
def start_stream # Configures client and fires startup event
def prepare
# Sanity check
if @username.nil?
raise ConfigurationError, "bot username cannot be nil"
end
if @consumer_key.nil? || @consumer_key.empty? ||
@consumer_secret.nil? || @consumer_key.empty?
log "Missing consumer_key or consumer_secret. These details can be acquired by registering a Twitter app at https://apps.twitter.com/"
exit 1
end
if @access_token.nil? || @access_token.empty? ||
@access_token_secret.nil? || @access_token_secret.empty?
log "Missing access_token or access_token_secret. Please run `ebooks auth`."
exit 1
end
twitter
fire(:startup)
end
# Start running user event stream
def start
log "starting tweet stream" log "starting tweet stream"
stream.user do |ev| stream.user do |ev|
@ -219,22 +296,9 @@ module Ebooks
end end
end end
def prepare
# Sanity check
if @username.nil?
raise ConfigurationError, "bot.username cannot be nil"
end
twitter
fire(:startup)
end
# Connects to tweetstream and opens event handlers for this bot
def start
start_stream
end
# Fire an event # Fire an event
# @param event [Symbol] event to fire
# @param args arguments for event handler
def fire(event, *args) def fire(event, *args)
handler = "on_#{event}".to_sym handler = "on_#{event}".to_sym
if respond_to? handler if respond_to? handler
@ -242,11 +306,17 @@ module Ebooks
end end
end end
def delay(&b) # Delay an action for a variable period of time
time = @delay.to_a.sample unless @delay.is_a? Integer # @param range [Range, Integer] range of seconds to choose for delay
def delay(range=@delay_range, &b)
time = range.to_a.sample unless range.is_a? Integer
sleep time sleep time
b.call
end end
# Check if a username is blacklisted
# @param username [String]
# @return [Boolean]
def blacklisted?(username) def blacklisted?(username)
if @blacklist.include?(username) if @blacklist.include?(username)
true true
@ -256,46 +326,37 @@ module Ebooks
end end
# Reply to a tweet or a DM. # Reply to a tweet or a DM.
# @param ev [Twitter::Tweet, Twitter::DirectMessage]
# @param text [String] contents of reply excluding reply_prefix
# @param opts [Hash] additional params to pass to twitter gem
def reply(ev, text, opts={}) def reply(ev, text, opts={})
opts = opts.clone opts = opts.clone
if ev.is_a? Twitter::DirectMessage if ev.is_a? Twitter::DirectMessage
return if blacklisted?(ev.sender.screen_name)
log "Sending DM to @#{ev.sender.screen_name}: #{text}" log "Sending DM to @#{ev.sender.screen_name}: #{text}"
twitter.create_direct_message(ev.sender.screen_name, text, opts) twitter.create_direct_message(ev.sender.screen_name, text, opts)
elsif ev.is_a? Twitter::Tweet elsif ev.is_a? Twitter::Tweet
meta = calc_meta(ev) meta = meta(ev)
if !interaction(ev.user.screen_name).continue? if conversation(ev).is_bot?(ev.user.screen_name)
log "Not replying to suspected bot @#{ev.user.screen_name}" log "Not replying to suspected bot @#{ev.user.screen_name}"
return return false
end end
if !meta[:mentions_bot] log "Replying to @#{ev.user.screen_name} with: #{meta.reply_prefix + text}"
if !userinfo(ev.user.screen_name).can_pester? tweet = twitter.update(meta.reply_prefix + text, in_reply_to_status_id: ev.id)
log "Not replying: leaving @#{ev.user.screen_name} alone" conversation(tweet).add(tweet)
return tweet
else
userinfo(ev.user.screen_name).pesters_left -= 1
end
end
log "Replying to @#{ev.user.screen_name} with: #{meta[:reply_prefix] + text}"
twitter.update(meta[:reply_prefix] + text, in_reply_to_status_id: ev.id)
else else
raise Exception("Don't know how to reply to a #{ev.class}") raise Exception("Don't know how to reply to a #{ev.class}")
end end
end end
# Favorite a tweet
# @param tweet [Twitter::Tweet]
def favorite(tweet) def favorite(tweet)
return if blacklisted?(tweet.user.screen_name)
log "Favoriting @#{tweet.user.screen_name}: #{tweet.text}" log "Favoriting @#{tweet.user.screen_name}: #{tweet.text}"
meta = calc_meta(tweet)
if !meta[:mentions_bot] && !userinfo(ev.user.screen_name).can_pester?
log "Not favoriting: leaving @#{ev.user.screen_name} alone"
end
begin begin
twitter.favorite(tweet.id) twitter.favorite(tweet.id)
rescue Twitter::Error::Forbidden rescue Twitter::Error::Forbidden
@ -303,8 +364,9 @@ module Ebooks
end end
end end
# Retweet a tweet
# @param tweet [Twitter::Tweet]
def retweet(tweet) def retweet(tweet)
return if blacklisted?(tweet.user.screen_name)
log "Retweeting @#{tweet.user.screen_name}: #{tweet.text}" log "Retweeting @#{tweet.user.screen_name}: #{tweet.text}"
begin begin
@ -314,21 +376,36 @@ module Ebooks
end end
end end
def follow(*args) # Follow a user
log "Following #{args}" # @param user [String] username or user id
twitter.follow(*args) def follow(user, *args)
log "Following #{user}"
twitter.follow(user, *args)
end end
def tweet(*args) # Unfollow a user
log "Tweeting #{args.inspect}" # @param user [String] username or user id
twitter.update(*args) def unfollow(user, *args)
log "Unfollowing #{user}"
twiter.unfollow(user, *args)
end end
# Tweet something
# @param text [String]
def tweet(text, *args)
log "Tweeting '#{text}'"
twitter.update(text, *args)
end
# Get a scheduler for this bot
# @return [Rufus::Scheduler]
def scheduler def scheduler
@scheduler ||= Rufus::Scheduler.new @scheduler ||= Rufus::Scheduler.new
end end
# could easily just be *args however the separation keeps it clean. # Tweet some text with an image
# @param txt [String]
# @param pic [String] filename
def pictweet(txt, pic, *args) def pictweet(txt, pic, *args)
log "Tweeting #{txt.inspect} - #{pic} #{args}" log "Tweeting #{txt.inspect} - #{pic} #{args}"
twitter.update_with_media(txt, File.new(pic), *args) twitter.update_with_media(txt, File.new(pic), *args)

View file

@ -1,82 +0,0 @@
module Ebooks
# Special INTERIM token represents sentence boundaries
# This is so we can include start and end of statements in model
# Due to the way the sentence tokenizer works, can correspond
# to multiple actual parts of text (such as ^, $, \n and .?!)
INTERIM = :interim
# This is an ngram-based Markov model optimized to build from a
# tokenized sentence list without requiring too much transformation
class MarkovModel
def self.build(sentences)
MarkovModel.new.consume(sentences)
end
def consume(sentences)
# These models are of the form ngram => [[sentence_pos, token_pos] || INTERIM, ...]
# We map by both bigrams and unigrams so we can fall back to the latter in
# cases where an input bigram is unavailable, such as starting a sentence
@sentences = sentences
@unigrams = {}
@bigrams = {}
sentences.each_with_index do |tokens, i|
last_token = INTERIM
tokens.each_with_index do |token, j|
@unigrams[last_token] ||= []
@unigrams[last_token] << [i, j]
@bigrams[last_token] ||= {}
@bigrams[last_token][token] ||= []
if j == tokens.length-1 # Mark sentence endings
@unigrams[token] ||= []
@unigrams[token] << INTERIM
@bigrams[last_token][token] << INTERIM
else
@bigrams[last_token][token] << [i, j+1]
end
last_token = token
end
end
self
end
def find_token(index)
if index == INTERIM
INTERIM
else
@sentences[index[0]][index[1]]
end
end
def chain(tokens)
if tokens.length == 1
matches = @unigrams[tokens[-1]]
else
matches = @bigrams[tokens[-2]][tokens[-1]]
matches = @unigrams[tokens[-1]] if matches.length < 2
end
if matches.empty?
# This should never happen unless a strange token is
# supplied from outside the dataset
raise ArgumentError, "Unable to continue chain for: #{tokens.inspect}"
end
next_token = find_token(matches.sample)
if next_token == INTERIM # We chose to end the sentence
return tokens
else
return chain(tokens + [next_token])
end
end
def generate
NLP.reconstruct(chain([INTERIM]))
end
end
end

View file

@ -8,16 +8,41 @@ require 'csv'
module Ebooks module Ebooks
class Model class Model
attr_accessor :hash, :tokens, :sentences, :mentions, :keywords # @return [Array<String>]
# An array of unique tokens. This is the main source of actual strings
# in the model. Manipulation of a token is done using its index
# in this array, which we call a "tiki"
attr_accessor :tokens
def self.consume(txtpath) # @return [Array<Array<Integer>>]
Model.new.consume(txtpath) # Sentences represented by arrays of tikis
attr_accessor :sentences
# @return [Array<Array<Integer>>]
# Sentences derived from Twitter mentions
attr_accessor :mentions
# @return [Array<String>]
# The top 200 most important keywords, in descending order
attr_accessor :keywords
# Generate a new model from a corpus file
# @param path [String]
# @return [Ebooks::Model]
def self.consume(path)
Model.new.consume(path)
end end
# Generate a new model from multiple corpus files
# @param paths [Array<String>]
# @return [Ebooks::Model]
def self.consume_all(paths) def self.consume_all(paths)
Model.new.consume_all(paths) Model.new.consume_all(paths)
end end
# Load a saved model
# @param path [String]
# @return [Ebooks::Model]
def self.load(path) def self.load(path)
model = Model.new model = Model.new
model.instance_eval do model.instance_eval do
@ -30,6 +55,8 @@ module Ebooks
model model
end end
# Save model to a file
# @param path [String]
def save(path) def save(path)
File.open(path, 'wb') do |f| File.open(path, 'wb') do |f|
f.write(Marshal.dump({ f.write(Marshal.dump({
@ -43,19 +70,22 @@ module Ebooks
end end
def initialize def initialize
# This is the only source of actual strings in the model. It is
# an array of unique tokens. Manipulation of a token is mostly done
# using its index in this array, which we call a "tiki"
@tokens = [] @tokens = []
# Reverse lookup tiki by token, for faster generation # Reverse lookup tiki by token, for faster generation
@tikis = {} @tikis = {}
end end
# Reverse lookup a token index from a token
# @param token [String]
# @return [Integer]
def tikify(token) def tikify(token)
@tikis[token] or (@tokens << token and @tikis[token] = @tokens.length-1) @tikis[token] or (@tokens << token and @tikis[token] = @tokens.length-1)
end end
# Convert a body of text into arrays of tikis
# @param text [String]
# @return [Array<Array<Integer>>]
def mass_tikify(text) def mass_tikify(text)
sentences = NLP.sentences(text) sentences = NLP.sentences(text)
@ -69,9 +99,10 @@ module Ebooks
end end
end end
# Consume a corpus into this model
# @param path [String]
def consume(path) def consume(path)
content = File.read(path, :encoding => 'utf-8') content = File.read(path, :encoding => 'utf-8')
@hash = Digest::MD5.hexdigest(content)
if path.split('.')[-1] == "json" if path.split('.')[-1] == "json"
log "Reading json corpus from #{path}" log "Reading json corpus from #{path}"
@ -94,6 +125,8 @@ module Ebooks
consume_lines(lines) consume_lines(lines)
end end
# Consume a sequence of lines
# @param lines [Array<String>]
def consume_lines(lines) def consume_lines(lines)
log "Removing commented lines and sorting mentions" log "Removing commented lines and sorting mentions"
@ -126,11 +159,12 @@ module Ebooks
self self
end end
# Consume multiple corpuses into this model
# @param paths [Array<String>]
def consume_all(paths) def consume_all(paths)
lines = [] lines = []
paths.each do |path| paths.each do |path|
content = File.read(path, :encoding => 'utf-8') content = File.read(path, :encoding => 'utf-8')
@hash = Digest::MD5.hexdigest(content)
if path.split('.')[-1] == "json" if path.split('.')[-1] == "json"
log "Reading json corpus from #{path}" log "Reading json corpus from #{path}"
@ -156,25 +190,26 @@ module Ebooks
consume_lines(lines) consume_lines(lines)
end end
def fix(tweet) # Correct encoding issues in generated text
# This seems to require an external api call # @param text [String]
#begin # @return [String]
# fixer = NLP.gingerice.parse(tweet) def fix(text)
# log fixer if fixer['corrections'] NLP.htmlentities.decode text
# tweet = fixer['result']
#rescue Exception => e
# log e.message
# log e.backtrace
#end
NLP.htmlentities.decode tweet
end end
# Check if an array of tikis comprises a valid tweet
# @param tikis [Array<Integer>]
# @param limit Integer how many chars we have left
def valid_tweet?(tikis, limit) def valid_tweet?(tikis, limit)
tweet = NLP.reconstruct(tikis, @tokens) tweet = NLP.reconstruct(tikis, @tokens)
tweet.length <= limit && !NLP.unmatched_enclosers?(tweet) tweet.length <= limit && !NLP.unmatched_enclosers?(tweet)
end end
# Generate some text
# @param limit [Integer] available characters
# @param generator [SuffixGenerator, nil]
# @param retry_limit [Integer] how many times to retry on duplicates
# @return [String]
def make_statement(limit=140, generator=nil, retry_limit=10) def make_statement(limit=140, generator=nil, retry_limit=10)
responding = !generator.nil? responding = !generator.nil?
generator ||= SuffixGenerator.build(@sentences) generator ||= SuffixGenerator.build(@sentences)
@ -209,12 +244,17 @@ module Ebooks
end end
# Test if a sentence has been copied verbatim from original # Test if a sentence has been copied verbatim from original
def verbatim?(tokens) # @param tikis [Array<Integer>]
@sentences.include?(tokens) || @mentions.include?(tokens) # @return [Boolean]
def verbatim?(tikis)
@sentences.include?(tikis) || @mentions.include?(tikis)
end end
# Finds all relevant tokenized sentences to given input by # Finds relevant and slightly relevant tokenized sentences to input
# comparing non-stopword token overlaps # comparing non-stopword token overlaps
# @param sentences [Array<Array<Integer>>]
# @param input [String]
# @return [Array<Array<Array<Integer>>, Array<Array<Integer>>>]
def find_relevant(sentences, input) def find_relevant(sentences, input)
relevant = [] relevant = []
slightly_relevant = [] slightly_relevant = []
@ -235,6 +275,10 @@ module Ebooks
# Generates a response by looking for related sentences # Generates a response by looking for related sentences
# in the corpus and building a smaller generator from these # in the corpus and building a smaller generator from these
# @param input [String]
# @param limit [Integer] characters available for response
# @param sentences [Array<Array<Integer>>]
# @return [String]
def make_response(input, limit=140, sentences=@mentions) def make_response(input, limit=140, sentences=@mentions)
# Prefer mentions # Prefer mentions
relevant, slightly_relevant = find_relevant(sentences, input) relevant, slightly_relevant = find_relevant(sentences, input)

View file

@ -12,31 +12,35 @@ module Ebooks
# Some of this stuff is pretty heavy and we don't necessarily need # Some of this stuff is pretty heavy and we don't necessarily need
# to be using it all of the time # to be using it all of the time
# Lazily loads an array of stopwords
# Stopwords are common English words that should often be ignored
# @return [Array<String>]
def self.stopwords def self.stopwords
@stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split @stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split
end end
# Lazily loads an array of known English nouns
# @return [Array<String>]
def self.nouns def self.nouns
@nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end end
# Lazily loads an array of known English adjectives
# @return [Array<String>]
def self.adjectives def self.adjectives
@adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end end
# POS tagger # Lazily load part-of-speech tagging library
# This can determine whether a word is being used as a noun/adjective/verb
# @return [EngTagger]
def self.tagger def self.tagger
require 'engtagger' require 'engtagger'
@tagger ||= EngTagger.new @tagger ||= EngTagger.new
end end
# Gingerice text correction service # Lazily load HTML entity decoder
def self.gingerice # @return [HTMLEntities]
require 'gingerice'
Gingerice::Parser.new # No caching for this one
end
# For decoding html entities
def self.htmlentities def self.htmlentities
require 'htmlentities' require 'htmlentities'
@htmlentities ||= HTMLEntities.new @htmlentities ||= HTMLEntities.new
@ -44,7 +48,9 @@ module Ebooks
### Utility functions ### Utility functions
# We don't really want to deal with all this weird unicode punctuation # Normalize some strange unicode punctuation variants
# @param text [String]
# @return [String]
def self.normalize(text) def self.normalize(text)
htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('', "'").gsub('…', '...') htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('', "'").gsub('…', '...')
end end
@ -53,6 +59,8 @@ module Ebooks
# We use ad hoc approach because fancy libraries do not deal # We use ad hoc approach because fancy libraries do not deal
# especially well with tweet formatting, and we can fake solving # especially well with tweet formatting, and we can fake solving
# the quote problem during generation # the quote problem during generation
# @param text [String]
# @return [Array<String>]
def self.sentences(text) def self.sentences(text)
text.split(/\n+|(?<=[.?!])\s+/) text.split(/\n+|(?<=[.?!])\s+/)
end end
@ -60,15 +68,23 @@ module Ebooks
# Split a sentence into word-level tokens # Split a sentence into word-level tokens
# As above, this is ad hoc because tokenization libraries # As above, this is ad hoc because tokenization libraries
# do not behave well wrt. things like emoticons and timestamps # do not behave well wrt. things like emoticons and timestamps
# @param sentence [String]
# @return [Array<String>]
def self.tokenize(sentence) def self.tokenize(sentence)
regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/ regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
sentence.split(regex) sentence.split(regex)
end end
# Get the 'stem' form of a word e.g. 'cats' -> 'cat'
# @param word [String]
# @return [String]
def self.stem(word) def self.stem(word)
Stemmer::stem_word(word.downcase) Stemmer::stem_word(word.downcase)
end end
# Use highscore gem to find interesting keywords in a corpus
# @param text [String]
# @return [Highscore::Keywords]
def self.keywords(text) def self.keywords(text)
# Preprocess to remove stopwords (highscore's blacklist is v. slow) # Preprocess to remove stopwords (highscore's blacklist is v. slow)
text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ') text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ')
@ -90,7 +106,10 @@ module Ebooks
text.keywords text.keywords
end end
# Takes a list of tokens and builds a nice-looking sentence # Builds a proper sentence from a list of tikis
# @param tikis [Array<Integer>]
# @param tokens [Array<String>]
# @return [String]
def self.reconstruct(tikis, tokens) def self.reconstruct(tikis, tokens)
text = "" text = ""
last_token = nil last_token = nil
@ -105,6 +124,9 @@ module Ebooks
end end
# Determine if we need to insert a space between two tokens # Determine if we need to insert a space between two tokens
# @param token1 [String]
# @param token2 [String]
# @return [Boolean]
def self.space_between?(token1, token2) def self.space_between?(token1, token2)
p1 = self.punctuation?(token1) p1 = self.punctuation?(token1)
p2 = self.punctuation?(token2) p2 = self.punctuation?(token2)
@ -119,10 +141,16 @@ module Ebooks
end end
end end
# Is this token comprised of punctuation?
# @param token [String]
# @return [Boolean]
def self.punctuation?(token) def self.punctuation?(token)
(token.chars.to_set - PUNCTUATION.chars.to_set).empty? (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end end
# Is this token a stopword?
# @param token [String]
# @return [Boolean]
def self.stopword?(token) def self.stopword?(token)
@stopword_set ||= stopwords.map(&:downcase).to_set @stopword_set ||= stopwords.map(&:downcase).to_set
@stopword_set.include?(token.downcase) @stopword_set.include?(token.downcase)
@ -130,7 +158,9 @@ module Ebooks
# Determine if a sample of text contains unmatched brackets or quotes # Determine if a sample of text contains unmatched brackets or quotes
# This is one of the more frequent and noticeable failure modes for # This is one of the more frequent and noticeable failure modes for
# the markov generator; we can just tell it to retry # the generator; we can just tell it to retry
# @param text [String]
# @return [Boolean]
def self.unmatched_enclosers?(text) def self.unmatched_enclosers?(text)
enclosers = ['**', '""', '()', '[]', '``', "''"] enclosers = ['**', '""', '()', '[]', '``', "''"]
enclosers.each do |pair| enclosers.each do |pair|
@ -153,10 +183,13 @@ module Ebooks
end end
# Determine if a2 is a subsequence of a1 # Determine if a2 is a subsequence of a1
# @param a1 [Array]
# @param a2 [Array]
# @return [Boolean]
def self.subseq?(a1, a2) def self.subseq?(a1, a2)
a1.each_index.find do |i| !a1.each_index.find do |i|
a1[i...i+a2.length] == a2 a1[i...i+a2.length] == a2
end end.nil?
end end
end end
end end

View file

@ -1,11 +1,14 @@
# encoding: utf-8 # encoding: utf-8
module Ebooks module Ebooks
# This generator uses data identical to the markov model, but # This generator uses data identical to a markov model, but
# instead of making a chain by looking up bigrams it uses the # instead of making a chain by looking up bigrams it uses the
# positions to randomly replace suffixes in one sentence with # positions to randomly replace suffixes in one sentence with
# matching suffixes in another # matching suffixes in another
class SuffixGenerator class SuffixGenerator
# Build a generator from a corpus of tikified sentences
# @param sentences [Array<Array<Integer>>]
# @return [SuffixGenerator]
def self.build(sentences) def self.build(sentences)
SuffixGenerator.new(sentences) SuffixGenerator.new(sentences)
end end
@ -39,6 +42,11 @@ module Ebooks
self self
end end
# Generate a recombined sequence of tikis
# @param passes [Integer] number of times to recombine
# @param n [Symbol] :unigrams or :bigrams (affects how conservative the model is)
# @return [Array<Integer>]
def generate(passes=5, n=:unigrams) def generate(passes=5, n=:unigrams)
index = rand(@sentences.length) index = rand(@sentences.length)
tikis = @sentences[index] tikis = @sentences[index]

View file

@ -1,3 +1,3 @@
module Ebooks module Ebooks
VERSION = "2.3.2" VERSION = "3.0.0"
end end

View file

@ -1,4 +1,4 @@
source 'http://rubygems.org' source 'http://rubygems.org'
ruby '1.9.3' ruby '{{RUBY_VERSION}}'
gem 'twitter_ebooks' gem 'twitter_ebooks'

View file

@ -1 +1 @@
worker: ruby run.rb start worker: ebooks start

59
skeleton/bots.rb Executable file → Normal file
View file

@ -1,42 +1,55 @@
#!/usr/bin/env ruby
require 'twitter_ebooks' require 'twitter_ebooks'
# This is an example bot definition with event handlers commented out # This is an example bot definition with event handlers commented out
# You can define as many of these as you like; they will run simultaneously # You can define and instantiate as many bots as you like
Ebooks::Bot.new("{{BOT_NAME}}") do |bot| class MyBot < Ebooks::Bot
# Consumer details come from registering an app at https://dev.twitter.com/ # Configuration here applies to all MyBots
# OAuth details can be fetched with https://github.com/marcel/twurl def configure
bot.consumer_key = "" # Your app consumer key # Consumer details come from registering an app at https://dev.twitter.com/
bot.consumer_secret = "" # Your app consumer secret # Once you have consumer details, use "ebooks auth" for new access tokens
bot.oauth_token = "" # Token connecting the app to this account self.consumer_key = '' # Your app consumer key
bot.oauth_token_secret = "" # Secret connecting the app to this account self.consumer_secret = '' # Your app consumer secret
bot.on_message do |dm| # Users to block instead of interacting with
self.blacklist = ['tnietzschequote']
# Range in seconds to randomize delay when bot.delay is called
self.delay_range = 1..6
end
def on_startup
scheduler.every '24h' do
# Tweet something every 24 hours
# See https://github.com/jmettraux/rufus-scheduler
# bot.tweet("hi")
# bot.pictweet("hi", "cuteselfie.jpg")
end
end
def on_message(dm)
# Reply to a DM # Reply to a DM
# bot.reply(dm, "secret secrets") # bot.reply(dm, "secret secrets")
end end
bot.on_follow do |user| def on_follow(user)
# Follow a user back # Follow a user back
# bot.follow(user[:screen_name]) # bot.follow(user[:screen_name])
end end
bot.on_mention do |tweet, meta| def on_mention(tweet)
# Reply to a mention # Reply to a mention
# bot.reply(tweet, meta[:reply_prefix] + "oh hullo") # bot.reply(tweet, meta(tweet)[:reply_prefix] + "oh hullo")
end end
bot.on_timeline do |tweet, meta| def on_timeline(tweet)
# Reply to a tweet in the bot's timeline # Reply to a tweet in the bot's timeline
# bot.reply(tweet, meta[:reply_prefix] + "nice tweet") # bot.reply(tweet, meta(tweet)[:reply_prefix] + "nice tweet")
end
bot.scheduler.every '24h' do
# Tweet something every 24 hours
# See https://github.com/jmettraux/rufus-scheduler
# bot.tweet("hi")
# bot.pictweet("hi", "cuteselfie.jpg", ":possibly_sensitive => true")
end end
end end
# Make a MyBot and attach it to an account
MyBot.new("{{BOT_NAME}}") do |bot|
bot.access_token = "" # Token connecting the app to this account
bot.access_token_secret = "" # Secret connecting the app to this account
end

View file

@ -1,9 +0,0 @@
#!/usr/bin/env ruby
require_relative 'bots'
EM.run do
Ebooks::Bot.all.each do |bot|
bot.start
end
end

View file

@ -3,13 +3,10 @@ require 'memory_profiler'
require 'tempfile' require 'tempfile'
require 'timecop' require 'timecop'
def Process.rss; `ps -o rss= -p #{Process.pid}`.chomp.to_i; end
class TestBot < Ebooks::Bot class TestBot < Ebooks::Bot
attr_accessor :twitter attr_accessor :twitter
def configure def configure
self.username = "test_ebooks"
end end
def on_direct_message(dm) def on_direct_message(dm)
@ -17,7 +14,7 @@ class TestBot < Ebooks::Bot
end end
def on_mention(tweet, meta) def on_mention(tweet, meta)
reply tweet, "echo: #{meta[:mentionless]}" reply tweet, "echo: #{meta.mentionless}"
end end
def on_timeline(tweet, meta) def on_timeline(tweet, meta)
@ -43,10 +40,11 @@ module Ebooks::Test
# Creates a mock tweet # Creates a mock tweet
# @param username User sending the tweet # @param username User sending the tweet
# @param text Tweet content # @param text Tweet content
def mock_tweet(username, text) def mock_tweet(username, text, extra={})
mentions = text.split.find_all { |x| x.start_with?('@') } mentions = text.split.find_all { |x| x.start_with?('@') }
Twitter::Tweet.new( tweet = Twitter::Tweet.new({
id: twitter_id, id: twitter_id,
in_reply_to_status_id: 'mock-link',
user: { id: twitter_id, screen_name: username }, user: { id: twitter_id, screen_name: username },
text: text, text: text,
created_at: Time.now.to_s, created_at: Time.now.to_s,
@ -56,29 +54,36 @@ module Ebooks::Test
indices: [text.index(m), text.index(m)+m.length] } indices: [text.index(m), text.index(m)+m.length] }
} }
} }
) }.merge!(extra))
tweet
end
def twitter_spy(bot)
twitter = spy("twitter")
allow(twitter).to receive(:update).and_return(mock_tweet(bot.username, "test tweet"))
twitter
end end
def simulate(bot, &b) def simulate(bot, &b)
bot.twitter = spy("twitter") bot.twitter = twitter_spy(bot)
b.call b.call
end end
def expect_direct_message(bot, content) def expect_direct_message(bot, content)
expect(bot.twitter).to have_received(:create_direct_message).with(anything(), content, {}) expect(bot.twitter).to have_received(:create_direct_message).with(anything(), content, {})
bot.twitter = spy("twitter") bot.twitter = twitter_spy(bot)
end end
def expect_tweet(bot, content) def expect_tweet(bot, content)
expect(bot.twitter).to have_received(:update).with(content, anything()) expect(bot.twitter).to have_received(:update).with(content, anything())
bot.twitter = spy("twitter") bot.twitter = twitter_spy(bot)
end end
end end
describe Ebooks::Bot do describe Ebooks::Bot do
include Ebooks::Test include Ebooks::Test
let(:bot) { TestBot.new } let(:bot) { TestBot.new('test_ebooks') }
before { Timecop.freeze } before { Timecop.freeze }
after { Timecop.return } after { Timecop.return }
@ -104,6 +109,20 @@ describe Ebooks::Bot do
end end
end end
it "links tweets to conversations correctly" do
tweet1 = mock_tweet("m1sp", "tweet 1", id: 1, in_reply_to_status_id: nil)
tweet2 = mock_tweet("m1sp", "tweet 2", id: 2, in_reply_to_status_id: 1)
tweet3 = mock_tweet("m1sp", "tweet 3", id: 3, in_reply_to_status_id: nil)
bot.conversation(tweet1).add(tweet1)
expect(bot.conversation(tweet2)).to eq(bot.conversation(tweet1))
bot.conversation(tweet2).add(tweet2)
expect(bot.conversation(tweet3)).to_not eq(bot.conversation(tweet2))
end
it "stops mentioning people after a certain limit" do it "stops mentioning people after a certain limit" do
simulate(bot) do simulate(bot) do
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 1")) bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 1"))

File diff suppressed because it is too large Load diff

View file

@ -1,18 +0,0 @@
#!/usr/bin/env ruby
# encoding: utf-8
require 'twitter_ebooks'
require 'minitest/autorun'
require 'benchmark'
module Ebooks
class TestKeywords < Minitest::Test
corpus = NLP.normalize(File.read(ARGV[0]))
puts "Finding and ranking keywords"
puts Benchmark.measure {
NLP.keywords(corpus).top(50).each do |keyword|
puts "#{keyword.text} #{keyword.weight}"
end
}
end
end

View file

@ -1,18 +0,0 @@
#!/usr/bin/env ruby
# encoding: utf-8
require 'twitter_ebooks'
require 'minitest/autorun'
module Ebooks
class TestTokenize < Minitest::Test
corpus = NLP.normalize(File.read(TEST_CORPUS_PATH))
sents = NLP.sentences(corpus).sample(10)
NLP.sentences(corpus).sample(10).each do |sent|
p sent
p NLP.tokenize(sent)
puts
end
end
end

View file

@ -18,8 +18,9 @@ Gem::Specification.new do |gem|
gem.add_development_dependency 'rspec' gem.add_development_dependency 'rspec'
gem.add_development_dependency 'rspec-mocks' gem.add_development_dependency 'rspec-mocks'
gem.add_development_dependency 'memory_profiler' gem.add_development_dependency 'memory_profiler'
gem.add_development_dependency 'pry-byebug'
gem.add_development_dependency 'timecop' gem.add_development_dependency 'timecop'
gem.add_development_dependency 'pry-byebug'
gem.add_development_dependency 'yard'
gem.add_runtime_dependency 'twitter', '~> 5.0' gem.add_runtime_dependency 'twitter', '~> 5.0'
gem.add_runtime_dependency 'simple_oauth' gem.add_runtime_dependency 'simple_oauth'
@ -30,4 +31,5 @@ Gem::Specification.new do |gem|
gem.add_runtime_dependency 'engtagger' gem.add_runtime_dependency 'engtagger'
gem.add_runtime_dependency 'fast-stemmer' gem.add_runtime_dependency 'fast-stemmer'
gem.add_runtime_dependency 'highscore' gem.add_runtime_dependency 'highscore'
gem.add_runtime_dependency 'pry'
end end