ruby - UTF-8 中的 File.readlines 无效字节序列（ArgumentError）

Question

我正在处理一个包含来自网络的数据的文件，并且在某些日志文件上遇到UTF-8 中的无效字节序列 (ArgumentError)错误。

a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

我正试图让这个解决方案发挥作用。我见过人们在做

.encode!('UTF-8', 'UTF-8', :invalid => :replace)

但它似乎不适用于File.readlines.

File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)

' : 未定义的方法 `encode!' 对于 # (NoMethodError)

在文件读取期间过滤/转换无效 UTF-8 字符的最直接方法是什么？

~~尝试 1~~

试过这个，但它失败了，同样的无效字节序列错误。

IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s| # extract three columns: time stamp, url, ip s = s.parse_csv; { timestamp: s[0], url: s[1], ip: s[3] } end

解决方案

这似乎对我有用。

a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

Ruby 是否提供了一种使用指定编码执行 File.read() 的方法？

score 6 · Accepted Answer

我正试图让这个解决方案发挥作用。我见过人们在做
   .encode!('UTF-8', 'UTF-8', :invalid => :replace)
但它似乎不适用于 File.readlines。

File.readlines 返回一个数组。数组没有编码方法。另一方面，字符串确实有一种编码方法。

您能否为上述替代方案提供一个示例。

require 'csv'

CSV.foreach("log.csv", encoding: "utf-8") do |row|
  md = row[0].match /watch\?v=/
  puts row[0], row[1], row[3] if md
end

或者，

CSV.foreach("log.csv", 'rb:utf-8') do |row|

如果您需要更快的速度，请使用 fastercsv gem。

这似乎对我有用。
File.readlines('log.csv', :encoding => 'ISO-8859-1')

是的，为了读取文件，您必须知道它的编码。

score 0 · Accepted Answer

在我的情况下，脚本默认为 US-ASCII，我不能随意在服务器上更改它，以免发生其他冲突。

我做了

File.readlines(email, :encoding => 'UTF-8').each do |line|

但这不适用于一些日文字符，所以我在下一行添加了这个，效果很好。

line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

ruby - UTF-8 中的 File.readlines 无效字节序列（ArgumentError）

2 回答 2

Related

Reference