MODEL: Read in utf-8, only parse CSV once
Ran into `Encoding::CompatibilityError` issue trying to consume my corpus (tweets.csv) on Windows 7, but this likely affects other environments as well. Fix: force reading corpus file contents as utf-8. Also a quick clean-up of the CSV flow to only parse the content once instead of double-dipping.
This commit is contained in:
parent
4b88d3326b
commit
be6ac9127f
1 changed files with 4 additions and 3 deletions
|
@ -19,7 +19,7 @@ module Ebooks
|
|||
end
|
||||
|
||||
def consume(path)
|
||||
content = File.read(path)
|
||||
content = File.read(path, :encoding => 'utf-8')
|
||||
@hash = Digest::MD5.hexdigest(content)
|
||||
|
||||
if path.split('.')[-1] == "json"
|
||||
|
@ -29,9 +29,10 @@ module Ebooks
|
|||
end
|
||||
elsif path.split('.')[-1] == "csv"
|
||||
log "Reading CSV corpus from #{path}"
|
||||
header = CSV.read(path).first
|
||||
content = CSV.parse(content)
|
||||
header = content.shift
|
||||
text_col = header.index('text')
|
||||
lines = CSV.read(path).drop(1).map do |tweet|
|
||||
lines = content.map do |tweet|
|
||||
tweet[text_col]
|
||||
end
|
||||
else
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue