Merge remote-tracking branch 'mispy/master'

This commit is contained in:
Eryn Wells 2017-05-07 14:38:19 +00:00
commit 5888d771e8
31 changed files with 216797 additions and 4 deletions

6
.gitignore vendored
View file

@ -1 +1,5 @@
corpus/
.*.swp
Gemfile.lock
pkg
.yardoc
doc

1
.rspec Normal file
View file

@ -0,0 +1 @@
--color

7
.travis.yml Normal file
View file

@ -0,0 +1,7 @@
rvm:
- 2.1.4
script:
- rspec spec
notifications:
email:
- ebooks@mispy.me

View file

@ -1,4 +1,4 @@
source 'http://rubygems.org'
ruby '2.2.0'
source 'https://rubygems.org'
gem 'twitter_ebooks'
# Specify your gem's dependencies in twitter_ebooks.gemspec
gemspec

22
LICENSE Normal file
View file

@ -0,0 +1,22 @@
Copyright (c) 2013 Jaiden Mispy
MIT License
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

147
README.md Normal file
View file

@ -0,0 +1,147 @@
# twitter\_ebooks
[![Gem Version](https://badge.fury.io/rb/twitter_ebooks.svg)](http://badge.fury.io/rb/twitter_ebooks)
[![Build Status](https://travis-ci.org/mispy/twitter_ebooks.svg)](https://travis-ci.org/mispy/twitter_ebooks)
[![Dependency Status](https://gemnasium.com/mispy/twitter_ebooks.svg)](https://gemnasium.com/mispy/twitter_ebooks)
A framework for building interactive twitterbots which respond to mentions/DMs. See [ebooks_example](https://github.com/mispy/ebooks_example) for a fully-fledged bot definition.
## New in 3.0
- About 80% less memory and storage use for models
- Bots run in their own threads (no eventmachine), and startup is parallelized
- Bots start with `ebooks start`, and no longer die on unhandled exceptions
- `ebooks auth` command will create new access tokens, for running multiple bots
- `ebooks console` starts a ruby interpreter with bots loaded (see Ebooks::Bot.all)
- Replies are slightly rate-limited to prevent infinite bot convos
- Non-participating users in a mention chain will be dropped after a few tweets
- [API documentation](http://rdoc.info/github/mispy/twitter_ebooks) and tests
Note that 3.0 is not backwards compatible with 2.x, so upgrade carefully! In particular, **make sure to regenerate your models** since the storage format changed.
## Installation
Requires Ruby 2.0+
```bash
gem install twitter_ebooks
```
## Setting up a bot
Run `ebooks new <reponame>` to generate a new repository containing a sample bots.rb file, which looks like this:
``` ruby
# This is an example bot definition with event handlers commented out
# You can define and instantiate as many bots as you like
class MyBot < Ebooks::Bot
# Configuration here applies to all MyBots
def configure
# Consumer details come from registering an app at https://dev.twitter.com/
# Once you have consumer details, use "ebooks auth" for new access tokens
self.consumer_key = "" # Your app consumer key
self.consumer_secret = "" # Your app consumer secret
# Users to block instead of interacting with
self.blacklist = ['tnietzschequote']
# Range in seconds to randomize delay when bot.delay is called
self.delay_range = 1..6
end
def on_startup
scheduler.every '24h' do
# Tweet something every 24 hours
# See https://github.com/jmettraux/rufus-scheduler
# tweet("hi")
# pictweet("hi", "cuteselfie.jpg")
end
end
def on_message(dm)
# Reply to a DM
# reply(dm, "secret secrets")
end
def on_follow(user)
# Follow a user back
# follow(user.screen_name)
end
def on_mention(tweet)
# Reply to a mention
# reply(tweet, meta(tweet).reply_prefix + "oh hullo")
end
def on_timeline(tweet)
# Reply to a tweet in the bot's timeline
# reply(tweet, meta(tweet).reply_prefix + "nice tweet")
end
end
# Make a MyBot and attach it to an account
MyBot.new("abby_ebooks") do |bot|
bot.access_token = "" # Token connecting the app to this account
bot.access_token_secret = "" # Secret connecting the app to this account
end
```
`ebooks start` will run all defined bots in their own threads. The easiest way to run bots in a semi-permanent fashion is with [Heroku](https://www.heroku.com); just make an app, push the bot repository to it, enable a worker process in the web interface and it ought to chug along merrily forever.
The underlying streaming and REST clients from the [twitter gem](https://github.com/sferik/twitter) can be accessed at `bot.stream` and `bot.twitter` respectively.
## Archiving accounts
twitter\_ebooks comes with a syncing tool to download and then incrementally update a local json archive of a user's tweets (in this case, my good friend @0xabad1dea):
``` zsh
➜ ebooks archive 0xabad1dea corpus/0xabad1dea.json
Currently 20209 tweets for 0xabad1dea
Received 67 new tweets
```
The first time you'll run this, it'll ask for auth details to connect with. Due to API limitations, for users with high numbers of tweets it may not be possible to get their entire history in the initial download. However, so long as you run it frequently enough you can maintain a perfect copy indefinitely into the future.
## Text models
In order to use the included text modeling, you'll first need to preprocess your archive into a more efficient form:
``` zsh
➜ ebooks consume corpus/0xabad1dea.json
Reading json corpus from corpus/0xabad1dea.json
Removing commented lines and sorting mentions
Segmenting text into sentences
Tokenizing 7075 statements and 17947 mentions
Ranking keywords
Corpus consumed to model/0xabad1dea.model
```
Notably, this works with both json tweet archives and plaintext files (based on file extension), so you can make a model out of any kind of text.
Text files use newlines and full stops to seperate statements.
Once you have a model, the primary use is to produce statements and related responses to input, using a pseudo-Markov generator:
``` ruby
> model = Ebooks::Model.load("model/0xabad1dea.model")
> model.make_statement(140)
=> "My Terrible Netbook may be the kind of person who buys Starbucks, but this Rackspace vuln is pretty straight up a backdoor"
> model.make_response("The NSA is coming!", 130)
=> "Hey - someone who claims to be an NSA conspiracy"
```
The secondary function is the "interesting keywords" list. For example, I use this to determine whether a bot wants to fav/retweet/reply to something in its timeline:
``` ruby
top100 = model.keywords.take(100)
tokens = Ebooks::NLP.tokenize(tweet[:text])
if tokens.find { |t| top100.include?(t) }
bot.favorite(tweet[:id])
end
```
## Bot niceness
twitter_ebooks will drop bystanders from mentions for you and avoid infinite bot conversations, but it won't prevent you from doing a lot of other spammy things. Make sure your bot is a good and polite citizen!

2
Rakefile Normal file
View file

@ -0,0 +1,2 @@
#!/usr/bin/env rake
require "bundler/gem_tasks"

389
bin/ebooks Executable file
View file

@ -0,0 +1,389 @@
#!/usr/bin/env ruby
# encoding: utf-8
require 'twitter_ebooks'
require 'ostruct'
require 'fileutils'
module Ebooks::Util
def pretty_exception(e)
end
end
module Ebooks::CLI
APP_PATH = Dir.pwd # XXX do some recursive thing instead
HELP = OpenStruct.new
HELP.default = <<STR
Usage:
ebooks help <command>
ebooks new <reponame>
ebooks s[tart]
ebooks c[onsole]
ebooks auth
ebooks consume <corpus_path> [corpus_path2] [...]
ebooks consume-all <model_name> <corpus_path> [corpus_path2] [...]
ebooks gen <model_path> [input]
ebooks archive <username> [path]
ebooks tweet <model_path> <botname>
STR
def self.help(command=nil)
if command.nil?
log HELP.default
else
log HELP[command].gsub(/^ {4}/, '')
end
end
HELP.new = <<-STR
Usage: ebooks new <reponame>
Creates a new skeleton repository defining a template bot in
the current working directory specified by <reponame>.
STR
def self.new(reponame)
if reponame.nil?
help :new
exit 1
end
path = "./#{reponame}"
if File.exists?(path)
log "#{path} already exists. Please remove if you want to recreate."
exit 1
end
FileUtils.cp_r(Ebooks::SKELETON_PATH, path)
FileUtils.mv(File.join(path, 'gitignore'), File.join(path, '.gitignore'))
File.open(File.join(path, 'bots.rb'), 'w') do |f|
template = File.read(File.join(Ebooks::SKELETON_PATH, 'bots.rb'))
f.write(template.gsub("{{BOT_NAME}}", reponame))
end
File.open(File.join(path, 'Gemfile'), 'w') do |f|
template = File.read(File.join(Ebooks::SKELETON_PATH, 'Gemfile'))
f.write(template.gsub("{{RUBY_VERSION}}", RUBY_VERSION))
end
log "New twitter_ebooks app created at #{reponame}"
end
HELP.consume = <<-STR
Usage: ebooks consume <corpus_path> [corpus_path2] [...]
Processes some number of text files or json tweet corpuses
into usable models. These will be output at model/<corpus_name>.model
STR
def self.consume(pathes)
if pathes.empty?
help :consume
exit 1
end
pathes.each do |path|
filename = File.basename(path)
shortname = filename.split('.')[0..-2].join('.')
outpath = File.join(APP_PATH, 'model', "#{shortname}.model")
Ebooks::Model.consume(path).save(outpath)
log "Corpus consumed to #{outpath}"
end
end
HELP.consume_all = <<-STR
Usage: ebooks consume-all <model_name> <corpus_path> [corpus_path2] [...]
Processes some number of text files or json tweet corpuses
into one usable model. It will be output at model/<model_name>.model
STR
def self.consume_all(name, paths)
if paths.empty?
help :consume_all
exit 1
end
outpath = File.join(APP_PATH, 'model', "#{name}.model")
Ebooks::Model.consume_all(paths).save(outpath)
log "Corpuses consumed to #{outpath}"
end
HELP.jsonify = <<-STR
Usage: ebooks jsonify <tweets.csv> [tweets.csv2] [...]
Takes a csv twitter archive and converts it to json.
STR
def self.jsonify(paths)
if paths.empty?
log usage
exit
end
paths.each do |path|
name = File.basename(path).split('.')[0]
new_path = name + ".json"
tweets = []
id = nil
if path.split('.')[-1] == "csv" #from twitter archive
csv_archive = CSV.read(path, :headers=>:first_row)
tweets = csv_archive.map do |tweet|
{ text: tweet['text'], id: tweet['tweet_id'] }
end
else
File.read(path).split("\n").each do |l|
if l.start_with?('# ')
id = l.split('# ')[-1]
else
tweet = { text: l }
if id
tweet[:id] = id
id = nil
end
tweets << tweet
end
end
end
File.open(new_path, 'w') do |f|
log "Writing #{tweets.length} tweets to #{new_path}"
f.write(JSON.pretty_generate(tweets))
end
end
end
HELP.gen = <<-STR
Usage: ebooks gen <model_path> [input]
Make a test tweet from the processed model at <model_path>.
Will respond to input if provided.
STR
def self.gen(model_path, input)
if model_path.nil?
help :gen
exit 1
end
model = Ebooks::Model.load(model_path)
if input && !input.empty?
puts "@cmd " + model.make_response(input, 135)
else
puts model.make_statement
end
end
HELP.archive = <<-STR
Usage: ebooks archive <username> [outpath]
Downloads a json corpus of the <username>'s tweets.
Output defaults to corpus/<username>.json
Due to API limitations, this can only receive up to ~3000 tweets
into the past.
STR
def self.archive(username, outpath=nil)
if username.nil?
help :archive
exit 1
end
Ebooks::Archive.new(username, outpath).sync
end
HELP.tweet = <<-STR
Usage: ebooks tweet <model_path> <botname>
Sends a public tweet from the specified bot using text
from the processed model at <model_path>.
STR
def self.tweet(modelpath, botname)
if modelpath.nil? || botname.nil?
help :tweet
exit 1
end
load File.join(APP_PATH, 'bots.rb')
model = Ebooks::Model.load(modelpath)
statement = model.make_statement
bot = Ebooks::Bot.get(botname)
bot.configure
bot.tweet(statement)
end
HELP.auth = <<-STR
Usage: ebooks auth
Authenticates your Twitter app for any account. By default, will
use the consumer key and secret from the first defined bot. You
can specify another by setting the CONSUMER_KEY and CONSUMER_SECRET
environment variables.
STR
def self.auth
consumer_key, consumer_secret = find_consumer
require 'oauth'
consumer = OAuth::Consumer.new(
consumer_key,
consumer_secret,
site: 'https://twitter.com/',
scheme: :header
)
request_token = consumer.get_request_token
auth_url = request_token.authorize_url()
pin = nil
loop do
log auth_url
log "Go to the above url and follow the prompts, then enter the PIN code here."
print "> "
pin = STDIN.gets.chomp
break unless pin.empty?
end
access_token = request_token.get_access_token(oauth_verifier: pin)
log "Account authorized successfully. Make sure to put these in your bots.rb!\n" +
" access token: #{access_token.token}\n" +
" access token secret: #{access_token.secret}"
end
HELP.console = <<-STR
Usage: ebooks c[onsole]
Starts an interactive ruby session with your bots loaded
and configured.
STR
def self.console
load_bots
require 'pry'; Ebooks.module_exec { pry }
end
HELP.start = <<-STR
Usage: ebooks s[tart] [botname]
Starts running bots. If botname is provided, only runs that bot.
STR
def self.start(botname=nil)
load_bots
if botname.nil?
bots = Ebooks::Bot.all
else
bots = Ebooks::Bot.all.select { |bot| bot.username == botname }
if bots.empty?
log "Couldn't find a defined bot for @#{botname}!"
exit 1
end
end
threads = []
bots.each do |bot|
threads << Thread.new { bot.prepare }
end
threads.each(&:join)
threads = []
bots.each do |bot|
threads << Thread.new do
loop do
begin
bot.start
rescue Exception => e
bot.log e.inspect
puts e.backtrace.map { |s| "\t"+s }.join("\n")
end
bot.log "Sleeping before reconnect"
sleep 60
end
end
end
threads.each(&:join)
end
# Non-command methods
def self.find_consumer
if ENV['CONSUMER_KEY'] && ENV['CONSUMER_SECRET']
log "Using consumer details from environment variables:\n" +
" consumer key: #{ENV['CONSUMER_KEY']}\n" +
" consumer secret: #{ENV['CONSUMER_SECRET']}"
return [ENV['CONSUMER_KEY'], ENV['CONSUMER_SECRET']]
end
load_bots
consumer_key = nil
consumer_secret = nil
Ebooks::Bot.all.each do |bot|
if bot.consumer_key && bot.consumer_secret
consumer_key = bot.consumer_key
consumer_secret = bot.consumer_secret
log "Using consumer details from @#{bot.username}:\n" +
" consumer key: #{bot.consumer_key}\n" +
" consumer secret: #{bot.consumer_secret}\n"
return consumer_key, consumer_secret
end
end
if consumer_key.nil? || consumer_secret.nil?
log "Couldn't find any consumer details to auth an account with.\n" +
"Please either configure a bot with consumer_key and consumer_secret\n" +
"or provide the CONSUMER_KEY and CONSUMER_SECRET environment variables."
exit 1
end
end
def self.load_bots
load 'bots.rb'
if Ebooks::Bot.all.empty?
puts "Couldn't find any bots! Please make sure bots.rb instantiates at least one bot."
end
end
def self.command(args)
if args.length == 0
help
exit 1
end
case args[0]
when "new" then new(args[1])
when "consume" then consume(args[1..-1])
when "consume-all" then consume_all(args[1], args[2..-1])
when "gen" then gen(args[1], args[2..-1].join(' '))
when "archive" then archive(args[1], args[2])
when "tweet" then tweet(args[1], args[2])
when "jsonify" then jsonify(args[1..-1])
when "auth" then auth
when "console" then console
when "c" then console
when "start" then start(args[1])
when "s" then start(args[1])
when "help" then help(args[1])
else
log "No such command '#{args[0]}'"
help
exit 1
end
end
end
Ebooks::CLI.command(ARGV)

1466
data/adjectives.txt Normal file

File diff suppressed because it is too large Load diff

2193
data/nouns.txt Normal file

File diff suppressed because it is too large Load diff

843
data/stopwords.txt Normal file
View file

@ -0,0 +1,843 @@
a
able
about
above
abst
accordance
according
accordingly
across
act
actually
added
adj
affected
affecting
affects
after
afterwards
again
against
ah
all
almost
alone
along
already
also
although
always
am
among
amongst
an
and
announce
another
any
anybody
anyhow
anymore
anyone
anything
anyway
anyways
anywhere
apparently
approximately
are
aren
arent
arise
around
as
aside
ask
asking
at
auth
available
away
awfully
b
back
be
became
because
become
becomes
becoming
been
before
beforehand
begin
beginning
beginnings
begins
behind
being
believe
below
beside
besides
between
beyond
biol
both
brief
briefly
but
by
c
ca
came
can
cannot
can't
cause
causes
certain
certainly
co
com
come
comes
contain
containing
contains
could
couldnt
d
date
did
didn't
different
do
does
doesn't
doing
done
don't
down
downwards
due
during
e
each
ed
edu
effect
eg
eight
eighty
either
else
elsewhere
end
ending
enough
especially
et
et-al
etc
even
ever
every
everybody
everyone
everything
everywhere
ex
except
f
far
few
ff
fifth
first
five
fix
followed
following
follows
for
former
formerly
forth
found
four
from
further
furthermore
g
gave
get
gets
getting
give
given
gives
giving
go
goes
gone
got
gotten
h
had
happens
hardly
has
hasn't
have
haven't
having
he
hed
hence
her
here
hereafter
hereby
herein
heres
hereupon
hers
herself
hes
hi
hid
him
himself
his
hither
home
how
howbeit
however
hundred
i
id
ie
if
i'll
im
immediate
immediately
importance
important
in
inc
indeed
index
information
instead
into
invention
inward
is
isn't
it
itd
it'll
its
itself
i've
j
just
k
keep
keeps
kept
kg
km
know
known
knows
l
largely
last
lately
later
latter
latterly
least
less
lest
let
lets
like
liked
likely
line
little
'll
look
looking
looks
ltd
m
made
mainly
make
makes
many
may
maybe
me
mean
means
meantime
meanwhile
merely
mg
might
million
miss
ml
more
moreover
most
mostly
mr
mrs
much
mug
must
my
myself
n
na
name
namely
nay
nd
near
nearly
necessarily
necessary
need
needs
neither
never
nevertheless
new
next
nine
ninety
no
nobody
non
none
nonetheless
noone
nor
normally
nos
not
noted
nothing
now
nowhere
o
obtain
obtained
obviously
of
off
often
oh
ok
okay
old
omitted
on
once
one
ones
only
onto
or
ord
other
others
otherwise
ought
our
ours
ourselves
out
outside
over
overall
owing
own
p
page
pages
part
particular
particularly
past
per
perhaps
placed
please
plus
poorly
possible
possibly
potentially
pp
predominantly
present
previously
primarily
probably
promptly
proud
provides
put
q
que
quickly
quite
qv
r
ran
rather
rd
re
readily
really
recent
recently
ref
refs
regarding
regardless
regards
related
relatively
research
respectively
resulted
resulting
results
right
run
s
said
same
saw
say
saying
says
sec
section
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sent
seven
several
shall
she
shed
she'll
shes
should
shouldn't
show
showed
shown
showns
shows
significant
significantly
similar
similarly
since
six
slightly
so
some
somebody
somehow
someone
somethan
something
sometime
sometimes
somewhat
somewhere
soon
sorry
specifically
specified
specify
specifying
still
stop
strongly
sub
substantially
successfully
such
sufficiently
suggest
sup
sure
t
take
taken
taking
tell
tends
th
than
thank
thanks
thanx
that
that'll
thats
that've
the
their
theirs
them
themselves
then
thence
there
thereafter
thereby
thered
therefore
therein
there'll
thereof
therere
theres
thereto
thereupon
there've
these
they
theyd
they'll
theyre
they've
think
this
those
thou
though
thoughh
thousand
throug
through
throughout
thru
thus
til
tip
to
together
too
took
toward
towards
tried
tries
truly
try
trying
ts
twice
two
u
un
under
unfortunately
unless
unlike
unlikely
until
unto
up
upon
ups
us
use
used
useful
usefully
usefulness
uses
using
usually
v
value
various
've
very
via
viz
vol
vols
vs
w
want
wants
was
wasn't
way
we
wed
welcome
we'll
went
were
weren't
we've
what
whatever
what'll
whats
when
whence
whenever
where
whereafter
whereas
whereby
wherein
wheres
whereupon
wherever
whether
which
while
whim
whither
who
whod
whoever
whole
who'll
whom
whomever
whos
whose
why
widely
willing
wish
with
within
without
won't
words
world
would
wouldn't
www
x
y
yes
yet
you
youd
you'll
your
youre
yours
yourself
yourselves
you've
z
zero
.
?
!
http
don
people
well
will
https
time
good
thing
twitter
pretty
it's
i'm
that's
you're
they're
there's
things
yeah
find
going
work
point
years
guess
bad
problem
real
kind
day
better
lot
stuff
i'd
read
thought
idea
case
word
hey
person
long
Dear
internet
tweet
he's
feel
wrong
call
hard
phone
ago
literally
remember
reason
called
course
bit
question
high
today
told
man
actual
year
three
book
assume
life
true
best
wow
video
times
works
fact
completely
totally
imo
open
lol
haha
cool
yep
ooh
great
ugh
tonight
talk
sounds
hahaha
whoa
cool
we're
guys
sweet
fortunately
hmm
aren't
sadly
talking
you'd
place
yup
what's
y'know
basically
god
shit
holy
interesting
news
guy
wait
oooh
gonna
current
let's
tomorrow
omg
hate
hope
fuck
oops
night
wear
wanna
fun
finally
whoops
nevermind
definitely
context
screen
free
exactly
big
house
half
working
play
heard
hmmm
damn
woah
tho
set
idk
sort
understand
kinda
seriously
btw
she's
hah
aww
ffs
it'd
that'd
hopefully
non
entirely
lots
entire
tend
hullo
clearly
surely
weird
start
help
nope

21
lib/twitter_ebooks.rb Normal file
View file

@ -0,0 +1,21 @@
$debug = false
def log(*args)
STDERR.print args.map(&:to_s).join(' ') + "\n"
STDERR.flush
end
module Ebooks
GEM_PATH = File.expand_path(File.join(File.dirname(__FILE__), '..'))
DATA_PATH = File.join(GEM_PATH, 'data')
SKELETON_PATH = File.join(GEM_PATH, 'skeleton')
TEST_PATH = File.join(GEM_PATH, 'test')
TEST_CORPUS_PATH = File.join(TEST_PATH, 'corpus/0xabad1dea.tweets')
INTERIM = :interim
end
require 'twitter_ebooks/nlp'
require 'twitter_ebooks/archive'
require 'twitter_ebooks/suffix'
require 'twitter_ebooks/model'
require 'twitter_ebooks/bot'

View file

@ -0,0 +1,102 @@
#!/usr/bin/env ruby
# encoding: utf-8
require 'twitter'
require 'json'
CONFIG_PATH = "#{ENV['HOME']}/.ebooksrc"
module Ebooks
class Archive
attr_reader :tweets
def make_client
if File.exists?(CONFIG_PATH)
@config = JSON.parse(File.read(CONFIG_PATH), symbolize_names: true)
else
@config = {}
puts "As Twitter no longer allows anonymous API access, you'll need to enter the auth details of any account to use for archiving. These will be stored in #{CONFIG_PATH} if you need to change them later."
print "Consumer key: "
@config[:consumer_key] = STDIN.gets.chomp
print "Consumer secret: "
@config[:consumer_secret] = STDIN.gets.chomp
print "Access token: "
@config[:oauth_token] = STDIN.gets.chomp
print "Access secret: "
@config[:oauth_token_secret] = STDIN.gets.chomp
File.open(CONFIG_PATH, 'w') do |f|
f.write(JSON.pretty_generate(@config))
end
end
Twitter::REST::Client.new do |config|
config.consumer_key = @config[:consumer_key]
config.consumer_secret = @config[:consumer_secret]
config.access_token = @config[:oauth_token]
config.access_token_secret = @config[:oauth_token_secret]
end
end
def initialize(username, path=nil, client=nil)
@username = username
@path = path || "corpus/#{username}.json"
if File.directory?(@path)
@path = File.join(@path, "#{username}.json")
end
@client = client || make_client
if File.exists?(@path)
@tweets = JSON.parse(File.read(@path, :encoding => 'utf-8'), symbolize_names: true)
log "Currently #{@tweets.length} tweets for #{@username}"
else
@tweets.nil?
log "New archive for @#{username} at #{@path}"
end
end
def sync
retries = 0
tweets = []
max_id = nil
opts = {
count: 200,
#include_rts: false,
trim_user: true
}
opts[:since_id] = @tweets[0][:id] unless @tweets.nil?
loop do
opts[:max_id] = max_id unless max_id.nil?
begin
new = @client.user_timeline(@username, opts)
rescue Twitter::Error::TooManyRequests
log "Rate limit exceeded. Waiting for 5 mins before retry."
sleep 60*5
retry
end
break if new.length <= 1
tweets += new
log "Received #{tweets.length} new tweets"
max_id = new.last.id
end
if tweets.length == 0
log "No new tweets"
else
@tweets ||= []
@tweets = tweets.map(&:attrs).each { |tw|
tw.delete(:entities)
} + @tweets
File.open(@path, 'w') do |f|
f.write(JSON.pretty_generate(@tweets))
end
end
end
end
end

469
lib/twitter_ebooks/bot.rb Normal file
View file

@ -0,0 +1,469 @@
# encoding: utf-8
require 'twitter'
require 'rufus/scheduler'
module Ebooks
class ConfigurationError < Exception
end
# Represents a single reply tree of tweets
class Conversation
attr_reader :last_update
# @param bot [Ebooks::Bot]
def initialize(bot)
@bot = bot
@tweets = []
@last_update = Time.now
end
# @param tweet [Twitter::Tweet] tweet to add
def add(tweet)
@tweets << tweet
@last_update = Time.now
end
# Make an informed guess as to whether a user is a bot based
# on their behavior in this conversation
def is_bot?(username)
usertweets = @tweets.select { |t| t.user.screen_name.downcase == username.downcase }
if usertweets.length > 2
if (usertweets[-1].created_at - usertweets[-3].created_at) < 10
return true
end
end
username.include?("ebooks")
end
# Figure out whether to keep this user in the reply prefix
# We want to avoid spamming non-participating users
def can_include?(username)
@tweets.length <= 4 ||
!@tweets.select { |t| t.user.screen_name.downcase == username.downcase }.empty?
end
end
# Meta information about a tweet that we calculate for ourselves
class TweetMeta
# @return [Array<String>] usernames mentioned in tweet
attr_accessor :mentions
# @return [String] text of tweets with mentions removed
attr_accessor :mentionless
# @return [Array<String>] usernames to include in a reply
attr_accessor :reply_mentions
# @return [String] mentions to start reply with
attr_accessor :reply_prefix
# @return [Integer] available chars for reply
attr_accessor :limit
# @return [Ebooks::Bot] associated bot
attr_accessor :bot
# @return [Twitter::Tweet] associated tweet
attr_accessor :tweet
# Check whether this tweet mentions our bot
# @return [Boolean]
def mentions_bot?
# To check if this is someone talking to us, ensure:
# - The tweet mentions list contains our username
# - The tweet is not being retweeted by somebody else
# - Or soft-retweeted by somebody else
@mentions.map(&:downcase).include?(@bot.username.downcase) && !@tweet.retweeted_status? && !@tweet.text.match(/([`'"“”]|RT|via|by|from)\s*@/i)
end
# @param bot [Ebooks::Bot]
# @param ev [Twitter::Tweet]
def initialize(bot, ev)
@bot = bot
@tweet = ev
@mentions = ev.attrs[:entities][:user_mentions].map { |x| x[:screen_name] }
# Process mentions to figure out who to reply to
# i.e. not self and nobody who has seen too many secondary mentions
reply_mentions = @mentions.reject do |m|
m.downcase == @bot.username.downcase || !@bot.conversation(ev).can_include?(m)
end
@reply_mentions = ([ev.user.screen_name] + reply_mentions).uniq
@reply_prefix = @reply_mentions.map { |m| '@'+m }.join(' ') + ' '
@limit = 140 - @reply_prefix.length
mless = ev.text
begin
ev.attrs[:entities][:user_mentions].reverse.each do |entity|
last = mless[entity[:indices][1]..-1]||''
mless = mless[0...entity[:indices][0]] + last.strip
end
rescue Exception
p ev.attrs[:entities][:user_mentions]
p ev.text
raise
end
@mentionless = mless
end
# Get an array of media uris in tweet.
# @param size [String] A twitter image size to return. Supported sizes are thumb, small, medium (default), large
# @return [Array<String>] image URIs included in tweet
def media_uris(size_input = '')
case size_input
when 'thumb'
size = ':thumb'
when 'small'
size = ':small'
when 'medium'
size = ':medium'
when 'large'
size = ':large'
else
size = ''
end
# Start collecting uris.
uris = []
if @tweet.media?
@tweet.media.each do |each_media|
uris << each_media.media_url.to_s + size
end
end
# and that's pretty much it!
uris
end
end
class Bot
# @return [String] OAuth consumer key for a Twitter app
attr_accessor :consumer_key
# @return [String] OAuth consumer secret for a Twitter app
attr_accessor :consumer_secret
# @return [String] OAuth access token from `ebooks auth`
attr_accessor :access_token
# @return [String] OAuth access secret from `ebooks auth`
attr_accessor :access_token_secret
# @return [Twitter::User] Twitter user object of bot
attr_accessor :user
# @return [String] Twitter username of bot
attr_accessor :username
# @return [Array<String>] list of usernames to block on contact
attr_accessor :blacklist
# @return [Hash{String => Ebooks::Conversation}] maps tweet ids to their conversation contexts
attr_accessor :conversations
# @return [Range, Integer] range of seconds to delay in delay method
attr_accessor :delay_range
# @return [Array] list of all defined bots
def self.all; @@all ||= []; end
# Fetches a bot by username
# @param username [String]
# @return [Ebooks::Bot]
def self.get(username)
all.find { |bot| bot.username == username }
end
# Logs info to stdout in the context of this bot
def log(*args)
STDOUT.print "@#{@username}: " + args.map(&:to_s).join(' ') + "\n"
STDOUT.flush
end
# Initializes and configures bot
# @param args Arguments passed to configure method
# @param b Block to call with new bot
def initialize(username, &b)
@blacklist ||= []
@conversations ||= {}
# Tweet ids we've already observed, to avoid duplication
@seen_tweets ||= {}
@username = username
@delay_range ||= 1..6
configure
b.call(self) unless b.nil?
Bot.all << self
end
def configure
raise ConfigurationError, "Please override the 'configure' method for subclasses of Ebooks::Bot."
end
# Find or create the conversation context for this tweet
# @param tweet [Twitter::Tweet]
# @return [Ebooks::Conversation]
def conversation(tweet)
conv = if tweet.in_reply_to_status_id?
@conversations[tweet.in_reply_to_status_id]
end
if conv.nil?
conv = @conversations[tweet.id] || Conversation.new(self)
end
if tweet.in_reply_to_status_id?
@conversations[tweet.in_reply_to_status_id] = conv
end
@conversations[tweet.id] = conv
# Expire any old conversations to prevent memory growth
@conversations.each do |k,v|
if v != conv && Time.now - v.last_update > 3600
@conversations.delete(k)
end
end
conv
end
# @return [Twitter::REST::Client] underlying REST client from twitter gem
def twitter
@twitter ||= Twitter::REST::Client.new do |config|
config.consumer_key = @consumer_key
config.consumer_secret = @consumer_secret
config.access_token = @access_token
config.access_token_secret = @access_token_secret
end
end
# @return [Twitter::Streaming::Client] underlying streaming client from twitter gem
def stream
@stream ||= Twitter::Streaming::Client.new do |config|
config.consumer_key = @consumer_key
config.consumer_secret = @consumer_secret
config.access_token = @access_token
config.access_token_secret = @access_token_secret
end
end
# Calculate some meta information about a tweet relevant for replying
# @param ev [Twitter::Tweet]
# @return [Ebooks::TweetMeta]
def meta(ev)
TweetMeta.new(self, ev)
end
# Receive an event from the twitter stream
# @param ev [Object] Twitter streaming event
def receive_event(ev)
case ev
when Array # Initial array sent on first connection
log "Online!"
fire(:connect, ev)
return
when Twitter::DirectMessage
return if ev.sender.id == @user.id # Don't reply to self
log "DM from @#{ev.sender.screen_name}: #{ev.text}"
fire(:message, ev)
when Twitter::Tweet
return unless ev.text # If it's not a text-containing tweet, ignore it
return if ev.user.id == @user.id # Ignore our own tweets
meta = meta(ev)
if blacklisted?(ev.user.screen_name)
log "Blocking blacklisted user @#{ev.user.screen_name}"
@twitter.block(ev.user.screen_name)
end
# Avoid responding to duplicate tweets
if @seen_tweets[ev.id]
log "Not firing event for duplicate tweet #{ev.id}"
return
else
@seen_tweets[ev.id] = true
end
if meta.mentions_bot?
log "Mention from @#{ev.user.screen_name}: #{ev.text}"
conversation(ev).add(ev)
fire(:mention, ev)
else
fire(:timeline, ev)
end
when Twitter::Streaming::Event
case ev.name
when :follow
return if ev.source.id == @user.id
log "Followed by #{ev.source.screen_name}"
fire(:follow, ev.source)
when :favorite, :unfavorite
return if ev.source.id == @user.id # Ignore our own favorites
log "@#{ev.source.screen_name} #{ev.name.to_s}d: #{ev.target_object.text}"
fire(ev.name, ev.source, ev.target_object)
when :user_update
update_myself ev.source
end
when Twitter::Streaming::DeletedTweet
# Pass
else
log ev
end
end
# Updates @user and calls on_user_update.
def update_myself(new_me=twitter.user)
@user = new_me if @user.nil? || new_me.id == @user.id
@username = @user.screen_name
log 'User information updated'
fire(:user_update)
end
# Configures client and fires startup event
def prepare
# Sanity check
if @username.nil?
raise ConfigurationError, "bot username cannot be nil"
end
if @consumer_key.nil? || @consumer_key.empty? ||
@consumer_secret.nil? || @consumer_key.empty?
log "Missing consumer_key or consumer_secret. These details can be acquired by registering a Twitter app at https://apps.twitter.com/"
exit 1
end
if @access_token.nil? || @access_token.empty? ||
@access_token_secret.nil? || @access_token_secret.empty?
log "Missing access_token or access_token_secret. Please run `ebooks auth`."
exit 1
end
# Save old name
old_name = username
# Load user object and actual username
update_myself
# Warn about mismatches unless it was clearly intentional
log "warning: bot expected to be @#{old_name} but connected to @#{username}" unless username == old_name || old_name.empty?
fire(:startup)
end
# Start running user event stream
def start
log "starting tweet stream"
stream.user do |ev|
receive_event ev
end
end
# Fire an event
# @param event [Symbol] event to fire
# @param args arguments for event handler
def fire(event, *args)
handler = "on_#{event}".to_sym
if respond_to? handler
self.send(handler, *args)
end
end
# Delay an action for a variable period of time
# @param range [Range, Integer] range of seconds to choose for delay
def delay(range=@delay_range, &b)
time = range.to_a.sample unless range.is_a? Integer
sleep time
b.call
end
# Check if a username is blacklisted
# @param username [String]
# @return [Boolean]
def blacklisted?(username)
if @blacklist.map(&:downcase).include?(username.downcase)
true
else
false
end
end
# Reply to a tweet or a DM.
# @param ev [Twitter::Tweet, Twitter::DirectMessage]
# @param text [String] contents of reply excluding reply_prefix
# @param opts [Hash] additional params to pass to twitter gem
def reply(ev, text, opts={})
opts = opts.clone
if ev.is_a? Twitter::DirectMessage
log "Sending DM to @#{ev.sender.screen_name}: #{text}"
twitter.create_direct_message(ev.sender.screen_name, text, opts)
elsif ev.is_a? Twitter::Tweet
meta = meta(ev)
if conversation(ev).is_bot?(ev.user.screen_name)
log "Not replying to suspected bot @#{ev.user.screen_name}"
return false
end
text = meta.reply_prefix + text unless text.match(/@#{Regexp.escape ev.user.screen_name}/i)
log "Replying to @#{ev.user.screen_name} with: #{text}"
tweet = twitter.update(text, opts.merge(in_reply_to_status_id: ev.id))
conversation(tweet).add(tweet)
tweet
else
raise Exception("Don't know how to reply to a #{ev.class}")
end
end
# Favorite a tweet
# @param tweet [Twitter::Tweet]
def favorite(tweet)
log "Favoriting @#{tweet.user.screen_name}: #{tweet.text}"
begin
twitter.favorite(tweet.id)
rescue Twitter::Error::Forbidden
log "Already favorited: #{tweet.user.screen_name}: #{tweet.text}"
end
end
# Retweet a tweet
# @param tweet [Twitter::Tweet]
def retweet(tweet)
log "Retweeting @#{tweet.user.screen_name}: #{tweet.text}"
begin
twitter.retweet(tweet.id)
rescue Twitter::Error::Forbidden
log "Already retweeted: #{tweet.user.screen_name}: #{tweet.text}"
end
end
# Follow a user
# @param user [String] username or user id
def follow(user, *args)
log "Following #{user}"
twitter.follow(user, *args)
end
# Unfollow a user
# @param user [String] username or user id
def unfollow(user, *args)
log "Unfollowing #{user}"
twitter.unfollow(user, *args)
end
# Tweet something
# @param text [String]
def tweet(text, *args)
log "Tweeting '#{text}'"
twitter.update(text, *args)
end
# Get a scheduler for this bot
# @return [Rufus::Scheduler]
def scheduler
@scheduler ||= Rufus::Scheduler.new
end
# Tweet some text with an image
# @param txt [String]
# @param pic [String] filename
def pictweet(txt, pic, *args)
log "Tweeting #{txt.inspect} - #{pic} #{args}"
twitter.update_with_media(txt, File.new(pic), *args)
end
end
end

299
lib/twitter_ebooks/model.rb Normal file
View file

@ -0,0 +1,299 @@
#!/usr/bin/env ruby
# encoding: utf-8
require 'json'
require 'set'
require 'digest/md5'
require 'csv'
module Ebooks
class Model
# @return [Array<String>]
# An array of unique tokens. This is the main source of actual strings
# in the model. Manipulation of a token is done using its index
# in this array, which we call a "tiki"
attr_accessor :tokens
# @return [Array<Array<Integer>>]
# Sentences represented by arrays of tikis
attr_accessor :sentences
# @return [Array<Array<Integer>>]
# Sentences derived from Twitter mentions
attr_accessor :mentions
# @return [Array<String>]
# The top 200 most important keywords, in descending order
attr_accessor :keywords
# Generate a new model from a corpus file
# @param path [String]
# @return [Ebooks::Model]
def self.consume(path)
Model.new.consume(path)
end
# Generate a new model from multiple corpus files
# @param paths [Array<String>]
# @return [Ebooks::Model]
def self.consume_all(paths)
Model.new.consume_all(paths)
end
# Load a saved model
# @param path [String]
# @return [Ebooks::Model]
def self.load(path)
model = Model.new
model.instance_eval do
props = Marshal.load(File.open(path, 'rb') { |f| f.read })
@tokens = props[:tokens]
@sentences = props[:sentences]
@mentions = props[:mentions]
@keywords = props[:keywords]
end
model
end
# Save model to a file
# @param path [String]
def save(path)
File.open(path, 'wb') do |f|
f.write(Marshal.dump({
tokens: @tokens,
sentences: @sentences,
mentions: @mentions,
keywords: @keywords
}))
end
self
end
def initialize
@tokens = []
# Reverse lookup tiki by token, for faster generation
@tikis = {}
end
# Reverse lookup a token index from a token
# @param token [String]
# @return [Integer]
def tikify(token)
@tikis[token] or (@tokens << token and @tikis[token] = @tokens.length-1)
end
# Convert a body of text into arrays of tikis
# @param text [String]
# @return [Array<Array<Integer>>]
def mass_tikify(text)
sentences = NLP.sentences(text)
sentences.map do |s|
tokens = NLP.tokenize(s).reject do |t|
# Don't include usernames/urls as tokens
t.include?('@') || t.include?('http')
end
tokens.map { |t| tikify(t) }
end
end
# Consume a corpus into this model
# @param path [String]
def consume(path)
content = File.read(path, :encoding => 'utf-8')
if path.split('.')[-1] == "json"
log "Reading json corpus from #{path}"
lines = JSON.parse(content).map do |tweet|
tweet['text']
end
elsif path.split('.')[-1] == "csv"
log "Reading CSV corpus from #{path}"
content = CSV.parse(content)
header = content.shift
text_col = header.index('text')
lines = content.map do |tweet|
tweet[text_col]
end
else
log "Reading plaintext corpus from #{path} (if this is a json or csv file, please rename the file with an extension and reconsume)"
lines = content.split("\n")
end
consume_lines(lines)
end
# Consume a sequence of lines
# @param lines [Array<String>]
def consume_lines(lines)
log "Removing commented lines and sorting mentions"
statements = []
mentions = []
lines.each do |l|
next if l.start_with?('#') # Remove commented lines
next if l.include?('RT') || l.include?('MT') # Remove soft retweets
if l.include?('@')
mentions << NLP.normalize(l)
else
statements << NLP.normalize(l)
end
end
text = statements.join("\n")
mention_text = mentions.join("\n")
lines = nil; statements = nil; mentions = nil # Allow garbage collection
log "Tokenizing #{text.count('\n')} statements and #{mention_text.count('\n')} mentions"
@sentences = mass_tikify(text)
@mentions = mass_tikify(mention_text)
log "Ranking keywords"
@keywords = NLP.keywords(text).top(200).map(&:to_s)
self
end
# Consume multiple corpuses into this model
# @param paths [Array<String>]
def consume_all(paths)
lines = []
paths.each do |path|
content = File.read(path, :encoding => 'utf-8')
if path.split('.')[-1] == "json"
log "Reading json corpus from #{path}"
l = JSON.parse(content).map do |tweet|
tweet['text']
end
lines.concat(l)
elsif path.split('.')[-1] == "csv"
log "Reading CSV corpus from #{path}"
content = CSV.parse(content)
header = content.shift
text_col = header.index('text')
l = content.map do |tweet|
tweet[text_col]
end
lines.concat(l)
else
log "Reading plaintext corpus from #{path}"
l = content.split("\n")
lines.concat(l)
end
end
consume_lines(lines)
end
# Correct encoding issues in generated text
# @param text [String]
# @return [String]
def fix(text)
NLP.htmlentities.decode text
end
# Check if an array of tikis comprises a valid tweet
# @param tikis [Array<Integer>]
# @param limit Integer how many chars we have left
def valid_tweet?(tikis, limit)
tweet = NLP.reconstruct(tikis, @tokens)
tweet.length <= limit && !NLP.unmatched_enclosers?(tweet)
end
# Generate some text
# @param limit [Integer] available characters
# @param generator [SuffixGenerator, nil]
# @param retry_limit [Integer] how many times to retry on invalid tweet
# @return [String]
def make_statement(limit=140, generator=nil, retry_limit=10)
responding = !generator.nil?
generator ||= SuffixGenerator.build(@sentences)
retries = 0
tweet = ""
while (tikis = generator.generate(3, :bigrams)) do
next if tikis.length <= 3 && !responding
break if valid_tweet?(tikis, limit)
retries += 1
break if retries >= retry_limit
end
if verbatim?(tikis) && tikis.length > 3 # We made a verbatim tweet by accident
while (tikis = generator.generate(3, :unigrams)) do
break if valid_tweet?(tikis, limit) && !verbatim?(tikis)
retries += 1
break if retries >= retry_limit
end
end
tweet = NLP.reconstruct(tikis, @tokens)
if retries >= retry_limit
log "Unable to produce valid non-verbatim tweet; using \"#{tweet}\""
end
fix tweet
end
# Test if a sentence has been copied verbatim from original
# @param tikis [Array<Integer>]
# @return [Boolean]
def verbatim?(tikis)
@sentences.include?(tikis) || @mentions.include?(tikis)
end
# Finds relevant and slightly relevant tokenized sentences to input
# comparing non-stopword token overlaps
# @param sentences [Array<Array<Integer>>]
# @param input [String]
# @return [Array<Array<Array<Integer>>, Array<Array<Integer>>>]
def find_relevant(sentences, input)
relevant = []
slightly_relevant = []
tokenized = NLP.tokenize(input).map(&:downcase)
sentences.each do |sent|
tokenized.each do |token|
if sent.map { |tiki| @tokens[tiki].downcase }.include?(token)
relevant << sent unless NLP.stopword?(token)
slightly_relevant << sent
end
end
end
[relevant, slightly_relevant]
end
# Generates a response by looking for related sentences
# in the corpus and building a smaller generator from these
# @param input [String]
# @param limit [Integer] characters available for response
# @param sentences [Array<Array<Integer>>]
# @return [String]
def make_response(input, limit=140, sentences=@mentions)
# Prefer mentions
relevant, slightly_relevant = find_relevant(sentences, input)
if relevant.length >= 3
generator = SuffixGenerator.build(relevant)
make_statement(limit, generator)
elsif slightly_relevant.length >= 5
generator = SuffixGenerator.build(slightly_relevant)
make_statement(limit, generator)
elsif sentences.equal?(@mentions)
make_response(input, limit, @sentences)
else
make_statement(limit)
end
end
end
end

195
lib/twitter_ebooks/nlp.rb Normal file
View file

@ -0,0 +1,195 @@
# encoding: utf-8
require 'fast-stemmer'
require 'highscore'
module Ebooks
module NLP
# We deliberately limit our punctuation handling to stuff we can do consistently
# It'll just be a part of another token if we don't split it out, and that's fine
PUNCTUATION = ".?!,"
# Lazy-load NLP libraries and resources
# Some of this stuff is pretty heavy and we don't necessarily need
# to be using it all of the time
# Lazily loads an array of stopwords
# Stopwords are common English words that should often be ignored
# @return [Array<String>]
def self.stopwords
@stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split
end
# Lazily loads an array of known English nouns
# @return [Array<String>]
def self.nouns
@nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end
# Lazily loads an array of known English adjectives
# @return [Array<String>]
def self.adjectives
@adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end
# Lazily load part-of-speech tagging library
# This can determine whether a word is being used as a noun/adjective/verb
# @return [EngTagger]
def self.tagger
require 'engtagger'
@tagger ||= EngTagger.new
end
# Lazily load HTML entity decoder
# @return [HTMLEntities]
def self.htmlentities
require 'htmlentities'
@htmlentities ||= HTMLEntities.new
end
### Utility functions
# Normalize some strange unicode punctuation variants
# @param text [String]
# @return [String]
def self.normalize(text)
htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('', "'").gsub('…', '...')
end
# Split text into sentences
# We use ad hoc approach because fancy libraries do not deal
# especially well with tweet formatting, and we can fake solving
# the quote problem during generation
# @param text [String]
# @return [Array<String>]
def self.sentences(text)
text.split(/\n+|(?<=[.?!])\s+/)
end
# Split a sentence into word-level tokens
# As above, this is ad hoc because tokenization libraries
# do not behave well wrt. things like emoticons and timestamps
# @param sentence [String]
# @return [Array<String>]
def self.tokenize(sentence)
regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
sentence.split(regex)
end
# Get the 'stem' form of a word e.g. 'cats' -> 'cat'
# @param word [String]
# @return [String]
def self.stem(word)
Stemmer::stem_word(word.downcase)
end
# Use highscore gem to find interesting keywords in a corpus
# @param text [String]
# @return [Highscore::Keywords]
def self.keywords(text)
# Preprocess to remove stopwords (highscore's blacklist is v. slow)
text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ')
text = Highscore::Content.new(text)
text.configure do
#set :multiplier, 2
#set :upper_case, 3
#set :long_words, 2
#set :long_words_threshold, 15
#set :vowels, 1 # => default: 0 = not considered
#set :consonants, 5 # => default: 0 = not considered
#set :ignore_case, true # => default: false
set :word_pattern, /(?<!@)(?<=\s)[\w']+/ # => default: /\w+/
#set :stemming, true # => default: false
end
text.keywords
end
# Builds a proper sentence from a list of tikis
# @param tikis [Array<Integer>]
# @param tokens [Array<String>]
# @return [String]
def self.reconstruct(tikis, tokens)
text = ""
last_token = nil
tikis.each do |tiki|
next if tiki == INTERIM
token = tokens[tiki]
text += ' ' if last_token && space_between?(last_token, token)
text += token
last_token = token
end
text
end
# Determine if we need to insert a space between two tokens
# @param token1 [String]
# @param token2 [String]
# @return [Boolean]
def self.space_between?(token1, token2)
p1 = self.punctuation?(token1)
p2 = self.punctuation?(token2)
if p1 && p2 # "foo?!"
false
elsif !p1 && p2 # "foo."
false
elsif p1 && !p2 # "foo. rah"
true
else # "foo rah"
true
end
end
# Is this token comprised of punctuation?
# @param token [String]
# @return [Boolean]
def self.punctuation?(token)
(token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end
# Is this token a stopword?
# @param token [String]
# @return [Boolean]
def self.stopword?(token)
@stopword_set ||= stopwords.map(&:downcase).to_set
@stopword_set.include?(token.downcase)
end
# Determine if a sample of text contains unmatched brackets or quotes
# This is one of the more frequent and noticeable failure modes for
# the generator; we can just tell it to retry
# @param text [String]
# @return [Boolean]
def self.unmatched_enclosers?(text)
enclosers = ['**', '""', '()', '[]', '``', "''"]
enclosers.each do |pair|
starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')
opened = 0
tokenize(text).each do |token|
opened += 1 if token.match(starter)
opened -= 1 if token.match(ender)
return true if opened < 0 # Too many ends!
end
return true if opened != 0 # Mismatch somewhere.
end
false
end
# Determine if a2 is a subsequence of a1
# @param a1 [Array]
# @param a2 [Array]
# @return [Boolean]
def self.subseq?(a1, a2)
!a1.each_index.find do |i|
a1[i...i+a2.length] == a2
end.nil?
end
end
end

View file

@ -0,0 +1,95 @@
# encoding: utf-8
module Ebooks
# This generator uses data identical to a markov model, but
# instead of making a chain by looking up bigrams it uses the
# positions to randomly replace suffixes in one sentence with
# matching suffixes in another
class SuffixGenerator
# Build a generator from a corpus of tikified sentences
# @param sentences [Array<Array<Integer>>]
# @return [SuffixGenerator]
def self.build(sentences)
SuffixGenerator.new(sentences)
end
def initialize(sentences)
@sentences = sentences.reject { |s| s.length < 2 }
@unigrams = {}
@bigrams = {}
@sentences.each_with_index do |tikis, i|
last_tiki = INTERIM
tikis.each_with_index do |tiki, j|
@unigrams[last_tiki] ||= []
@unigrams[last_tiki] << [i, j]
@bigrams[last_tiki] ||= {}
@bigrams[last_tiki][tiki] ||= []
if j == tikis.length-1 # Mark sentence endings
@unigrams[tiki] ||= []
@unigrams[tiki] << [i, INTERIM]
@bigrams[last_tiki][tiki] << [i, INTERIM]
else
@bigrams[last_tiki][tiki] << [i, j+1]
end
last_tiki = tiki
end
end
self
end
# Generate a recombined sequence of tikis
# @param passes [Integer] number of times to recombine
# @param n [Symbol] :unigrams or :bigrams (affects how conservative the model is)
# @return [Array<Integer>]
def generate(passes=5, n=:unigrams)
index = rand(@sentences.length)
tikis = @sentences[index]
used = [index] # Sentences we've already used
verbatim = [tikis] # Verbatim sentences to avoid reproducing
0.upto(passes-1) do
varsites = {} # Map bigram start site => next tiki alternatives
tikis.each_with_index do |tiki, i|
next_tiki = tikis[i+1]
break if next_tiki.nil?
alternatives = (n == :unigrams) ? @unigrams[next_tiki] : @bigrams[tiki][next_tiki]
# Filter out suffixes from previous sentences
alternatives.reject! { |a| a[1] == INTERIM || used.include?(a[0]) }
varsites[i] = alternatives unless alternatives.empty?
end
variant = nil
varsites.to_a.shuffle.each do |site|
start = site[0]
site[1].shuffle.each do |alt|
verbatim << @sentences[alt[0]]
suffix = @sentences[alt[0]][alt[1]..-1]
potential = tikis[0..start+1] + suffix
# Ensure we're not just rebuilding some segment of another sentence
unless verbatim.find { |v| NLP.subseq?(v, potential) || NLP.subseq?(potential, v) }
used << alt[0]
variant = potential
break
end
end
break if variant
end
tikis = variant if variant
end
tikis
end
end
end

View file

@ -0,0 +1,3 @@
module Ebooks
VERSION = "3.1.0"
end

4
skeleton/Gemfile Normal file
View file

@ -0,0 +1,4 @@
source 'http://rubygems.org'
ruby '{{RUBY_VERSION}}'
gem 'twitter_ebooks'

1
skeleton/Procfile Normal file
View file

@ -0,0 +1 @@
worker: bundle exec ebooks start

60
skeleton/bots.rb Normal file
View file

@ -0,0 +1,60 @@
require 'twitter_ebooks'
# This is an example bot definition with event handlers commented out
# You can define and instantiate as many bots as you like
class MyBot < Ebooks::Bot
# Configuration here applies to all MyBots
def configure
# Consumer details come from registering an app at https://dev.twitter.com/
# Once you have consumer details, use "ebooks auth" for new access tokens
self.consumer_key = '' # Your app consumer key
self.consumer_secret = '' # Your app consumer secret
# Users to block instead of interacting with
self.blacklist = ['tnietzschequote']
# Range in seconds to randomize delay when bot.delay is called
self.delay_range = 1..6
end
def on_startup
scheduler.every '24h' do
# Tweet something every 24 hours
# See https://github.com/jmettraux/rufus-scheduler
# tweet("hi")
# pictweet("hi", "cuteselfie.jpg")
end
end
def on_message(dm)
# Reply to a DM
# reply(dm, "secret secrets")
end
def on_follow(user)
# Follow a user back
# follow(user.screen_name)
end
def on_mention(tweet)
# Reply to a mention
# reply(tweet, "oh hullo")
end
def on_timeline(tweet)
# Reply to a tweet in the bot's timeline
# reply(tweet, "nice tweet")
end
def on_favorite(user, tweet)
# Follow user who just favorited bot's tweet
# follow(user.screen_name)
end
end
# Make a MyBot and attach it to an account
MyBot.new("{{BOT_NAME}}") do |bot|
bot.access_token = "" # Token connecting the app to this account
bot.access_token_secret = "" # Secret connecting the app to this account
end

0
skeleton/corpus/.gitignore vendored Normal file
View file

1
skeleton/gitignore Normal file
View file

@ -0,0 +1 @@
corpus/

0
skeleton/model/.gitignore vendored Normal file
View file

216
spec/bot_spec.rb Normal file
View file

@ -0,0 +1,216 @@
require 'spec_helper'
require 'memory_profiler'
require 'tempfile'
require 'timecop'
class TestBot < Ebooks::Bot
attr_accessor :twitter
def configure
end
def on_message(dm)
reply dm, "echo: #{dm.text}"
end
def on_mention(tweet)
reply tweet, "echo: #{meta(tweet).mentionless}"
end
def on_timeline(tweet)
reply tweet, "fine tweet good sir"
end
end
module Ebooks::Test
# Generates a random twitter id
# Or a non-random one, given a string.
def twitter_id(seed = nil)
if seed.nil?
(rand*10**18).to_i
else
id = 1
seed.downcase.each_byte do |byte|
id *= byte/10
end
id
end
end
# Creates a mock direct message
# @param username User sending the DM
# @param text DM content
def mock_dm(username, text)
Twitter::DirectMessage.new(id: twitter_id,
sender: { id: twitter_id(username), screen_name: username},
text: text)
end
# Creates a mock tweet
# @param username User sending the tweet
# @param text Tweet content
def mock_tweet(username, text, extra={})
mentions = text.split.find_all { |x| x.start_with?('@') }
tweet = Twitter::Tweet.new({
id: twitter_id,
in_reply_to_status_id: 'mock-link',
user: { id: twitter_id(username), screen_name: username },
text: text,
created_at: Time.now.to_s,
entities: {
user_mentions: mentions.map { |m|
{ screen_name: m.split('@')[1],
indices: [text.index(m), text.index(m)+m.length] }
}
}
}.merge!(extra))
tweet
end
# Creates a mock user
def mock_user(username)
Twitter::User.new(id: twitter_id(username), screen_name: username)
end
def twitter_spy(bot)
twitter = spy("twitter")
allow(twitter).to receive(:update).and_return(mock_tweet(bot.username, "test tweet"))
allow(twitter).to receive(:user).with(no_args).and_return(mock_user(bot.username))
twitter
end
def simulate(bot, &b)
bot.twitter = twitter_spy(bot)
bot.update_myself # Usually called in prepare
b.call
end
def expect_direct_message(bot, content)
expect(bot.twitter).to have_received(:create_direct_message).with(anything(), content, {})
bot.twitter = twitter_spy(bot)
end
def expect_tweet(bot, content)
expect(bot.twitter).to have_received(:update).with(content, anything())
bot.twitter = twitter_spy(bot)
end
end
describe Ebooks::Bot do
include Ebooks::Test
let(:bot) { TestBot.new('Test_Ebooks') }
before { Timecop.freeze }
after { Timecop.return }
it "responds to dms" do
simulate(bot) do
bot.receive_event(mock_dm("m1sp", "this is a dm"))
expect_direct_message(bot, "echo: this is a dm")
end
end
it "ignores its own dms" do
simulate(bot) do
expect(bot).to_not receive(:on_message)
bot.receive_event(mock_dm("Test_Ebooks", "why am I talking to myself"))
end
end
it "responds to mentions" do
simulate(bot) do
bot.receive_event(mock_tweet("m1sp", "@test_ebooks this is a mention"))
expect_tweet(bot, "@m1sp echo: this is a mention")
end
end
it "ignores its own mentions" do
simulate(bot) do
expect(bot).to_not receive(:on_mention)
expect(bot).to_not receive(:on_timeline)
bot.receive_event(mock_tweet("Test_Ebooks", "@m1sp i think that @test_ebooks is best bot"))
end
end
it "responds to timeline tweets" do
simulate(bot) do
bot.receive_event(mock_tweet("m1sp", "some excellent tweet"))
expect_tweet(bot, "@m1sp fine tweet good sir")
end
end
it "ignores its own timeline tweets" do
simulate(bot) do
expect(bot).to_not receive(:on_timeline)
bot.receive_event(mock_tweet("Test_Ebooks", "pudding is cute"))
end
end
it "links tweets to conversations correctly" do
tweet1 = mock_tweet("m1sp", "tweet 1", id: 1, in_reply_to_status_id: nil)
tweet2 = mock_tweet("m1sp", "tweet 2", id: 2, in_reply_to_status_id: 1)
tweet3 = mock_tweet("m1sp", "tweet 3", id: 3, in_reply_to_status_id: nil)
bot.conversation(tweet1).add(tweet1)
expect(bot.conversation(tweet2)).to eq(bot.conversation(tweet1))
bot.conversation(tweet2).add(tweet2)
expect(bot.conversation(tweet3)).to_not eq(bot.conversation(tweet2))
end
it "stops mentioning people after a certain limit" do
simulate(bot) do
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 1"))
expect_tweet(bot, "@spammer @m1sp echo: 1")
Timecop.travel(Time.now + 60)
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 2"))
expect_tweet(bot, "@spammer @m1sp echo: 2")
Timecop.travel(Time.now + 60)
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 3"))
expect_tweet(bot, "@spammer echo: 3")
end
end
it "doesn't stop mentioning them if they reply" do
simulate(bot) do
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 4"))
expect_tweet(bot, "@spammer @m1sp echo: 4")
Timecop.travel(Time.now + 60)
bot.receive_event(mock_tweet("m1sp", "@spammer @test_ebooks 5"))
expect_tweet(bot, "@m1sp @spammer echo: 5")
Timecop.travel(Time.now + 60)
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 6"))
expect_tweet(bot, "@spammer @m1sp echo: 6")
end
end
it "doesn't get into infinite bot conversations" do
simulate(bot) do
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 7"))
expect_tweet(bot, "@spammer @m1sp echo: 7")
Timecop.travel(Time.now + 2)
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 8"))
expect_tweet(bot, "@spammer @m1sp echo: 8")
Timecop.travel(Time.now + 2)
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 9"))
expect(bot.twitter).to_not have_received(:update)
end
end
it "blocks blacklisted users on contact" do
simulate(bot) do
bot.blacklist = ["spammer"]
bot.receive_event(mock_tweet("spammer", "@test_ebooks @m1sp 7"))
expect(bot.twitter).to have_received(:block).with("spammer")
end
end
end

203945
spec/data/0xabad1dea.json Normal file

File diff suppressed because it is too large Load diff

6157
spec/data/0xabad1dea.model Normal file

File diff suppressed because it is too large Load diff

37
spec/memprof.rb Normal file
View file

@ -0,0 +1,37 @@
require 'objspace'
module MemoryUsage
MemoryReport = Struct.new(:total_memsize)
def self.full_gc
GC.start(full_mark: true)
end
def self.report(&block)
rvalue_size = GC::INTERNAL_CONSTANTS[:RVALUE_SIZE]
full_gc
GC.disable
total_memsize = 0
generation = nil
ObjectSpace.trace_object_allocations do
generation = GC.count
block.call
end
ObjectSpace.each_object do |obj|
next unless generation == ObjectSpace.allocation_generation(obj)
memsize = ObjectSpace.memsize_of(obj) + rvalue_size
# compensate for API bug
memsize = rvalue_size if memsize > 100_000_000_000
total_memsize += memsize
end
GC.enable
full_gc
return MemoryReport.new(total_memsize)
end
end

74
spec/model_spec.rb Normal file
View file

@ -0,0 +1,74 @@
require 'spec_helper'
require 'memory_profiler'
require 'tempfile'
def Process.rss; `ps -o rss= -p #{Process.pid}`.chomp.to_i; end
describe Ebooks::Model do
describe 'making tweets' do
before(:all) { @model = Ebooks::Model.consume(path("data/0xabad1dea.json")) }
it "generates a tweet" do
s = @model.make_statement
expect(s.length).to be <= 140
puts s
end
it "generates an appropriate response" do
s = @model.make_response("hi")
expect(s.length).to be <= 140
expect(s.downcase).to include("hi")
puts s
end
end
it "consumes, saves and loads models correctly" do
model = nil
report = MemoryUsage.report do
model = Ebooks::Model.consume(path("data/0xabad1dea.json"))
end
expect(report.total_memsize).to be < 200000000
file = Tempfile.new("0xabad1dea")
model.save(file.path)
report2 = MemoryUsage.report do
model = Ebooks::Model.load(file.path)
end
expect(report2.total_memsize).to be < 3000000
expect(model.tokens[0]).to be_a String
expect(model.sentences[0][0]).to be_a Fixnum
expect(model.mentions[0][0]).to be_a Fixnum
expect(model.keywords[0]).to be_a String
puts "0xabad1dea.model uses #{report2.total_memsize} bytes in memory"
end
describe '.consume' do
it 'interprets lines with @ as mentions' do
file = Tempfile.new('mentions')
file.write('@m1spy hello!')
file.close
model = Ebooks::Model.consume(file.path)
expect(model.sentences.count).to eq 0
expect(model.mentions.count).to eq 1
file.unlink
end
it 'interprets lines without @ as statements' do
file = Tempfile.new('statements')
file.write('hello!')
file.close
model = Ebooks::Model.consume(file.path)
expect(model.mentions.count).to eq 0
expect(model.sentences.count).to eq 1
file.unlink
end
end
end

6
spec/spec_helper.rb Normal file
View file

@ -0,0 +1,6 @@
require 'twitter_ebooks'
require_relative 'memprof'
def path(relpath)
File.join(File.dirname(__FILE__), relpath)
end

34
twitter_ebooks.gemspec Normal file
View file

@ -0,0 +1,34 @@
# -*- encoding: utf-8 -*-
require File.expand_path('../lib/twitter_ebooks/version', __FILE__)
Gem::Specification.new do |gem|
gem.authors = ["Jaiden Mispy"]
gem.email = ["^_^@mispy.me"]
gem.description = %q{Markov chains for all your friends~}
gem.summary = %q{Markov chains for all your friends~}
gem.homepage = ""
gem.files = `git ls-files`.split($\)
gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
gem.name = "twitter_ebooks"
gem.require_paths = ["lib"]
gem.version = Ebooks::VERSION
gem.add_development_dependency 'rspec'
gem.add_development_dependency 'rspec-mocks'
gem.add_development_dependency 'memory_profiler'
gem.add_development_dependency 'timecop'
gem.add_development_dependency 'pry-byebug'
gem.add_development_dependency 'yard'
gem.add_runtime_dependency 'twitter', '~> 5.0'
gem.add_runtime_dependency 'rufus-scheduler'
gem.add_runtime_dependency 'gingerice'
gem.add_runtime_dependency 'htmlentities'
gem.add_runtime_dependency 'engtagger'
gem.add_runtime_dependency 'fast-stemmer'
gem.add_runtime_dependency 'highscore'
gem.add_runtime_dependency 'pry'
gem.add_runtime_dependency 'oauth'
end