ruby - UTF-8 中的 Ruby 无效字节序列 (ArgumentError)

Question

可能重复：
ruby 1.9：UTF-8 中的无效字节序列

我目前正在构建一个文件系统爬虫并在运行我的脚本时收到以下错误：

wordcrawler.rb:8:in `block in <main>': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/Anconia/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:41:in `block in find'
    from /Users/Anconia/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:40:in `catch'
    from /Users/Anconia/.rvm/rubies/ruby-1.9.3-p327/lib/ruby/1.9.1/find.rb:40:in `find'
    from wordcrawler.rb:5:in `<main>'

这是我的代码：

require 'find'

count = 0

Find.find('/Users/Anconia/') do |file|                   # '/' for root directory on OS X
  if file =~ /\b(\.txt|\.doc|\.docx)\b/                # check if filename ends in desired format
    contents = File.read(file)
      if contents =~ /regex/
      puts file
      count += 1
    end
  end
end

puts "#{count} files were found"

在我的开发环境中，我使用 ruby 1.9.3；但是，当我切换到 ruby 1.8.7 时，脚本运行正常。如果可能的话，我想继续使用 1.9.3。我已经尝试了这篇文章中的所有解决方案（ruby 1.9: invalid byte sequence in UTF-8），但我的问题仍然存在。有什么建议么？

score 6 · Accepted Answer

没有正确理解上述帖子的内容。至少可以将其用作本文的实施示例

require 'find'

count = 0

Find.find('/Users/Anconia/') do |file|                                              # '/' for root directory on OS X
  if file =~ /\b(\.txt|\.doc|\.docx)\b/                                           # check if filename ends in desired format
    contents = File.read(file).encode!('UTF-8', 'UTF-8', :invalid => :replace)    # resolves encoding errors - must use 1.9.3 else use iconv
      if contents =~ /regex/
        puts file
        count += 1
    end
  end
end

puts "#{count} files were found"

ruby - UTF-8 中的 Ruby 无效字节序列 (ArgumentError)

1 回答 1

Related

Reference