MODEL: Read in utf-8, only parse CSV once

Ran into `Encoding::CompatibilityError` issue trying to consume my corpus (tweets.csv) on Windows 7, but this likely affects other environments as well. 

Fix: force reading corpus file contents as utf-8.

Also a quick clean-up of the CSV flow to only parse the content once instead of double-dipping.
This commit is contained in:
Joel McCoy 2014-06-27 18:42:51 -04:00
parent 4b88d3326b
commit be6ac9127f

View file

@ -19,7 +19,7 @@ module Ebooks
end
def consume(path)
content = File.read(path)
content = File.read(path, :encoding => 'utf-8')
@hash = Digest::MD5.hexdigest(content)
if path.split('.')[-1] == "json"
@ -29,9 +29,10 @@ module Ebooks
end
elsif path.split('.')[-1] == "csv"
log "Reading CSV corpus from #{path}"
header = CSV.read(path).first
content = CSV.parse(content)
header = content.shift
text_col = header.index('text')
lines = CSV.read(path).drop(1).map do |tweet|
lines = content.map do |tweet|
tweet[text_col]
end
else