ruby - Ruby CSV BOM|StringIO 的 UTF-8 编码

Question

红宝石 2.6.3。

我一直在尝试将StringIO对象解析为CSV具有编码的实例bom|utf-8，以便剥离 BOM 字符（不需要的）并将内容编码为 UTF-8：

require 'csv'

CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze

content = StringIO.new("\xEF\xBB\xBFid\n123")
first_row = CSV.parse(content, CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF")     # This returns true

显然bom|utf-8编码不适用于StringIO对象，但我发现它确实适用于文件，例如：

require 'csv'

CSV_READ_OPTIONS = { headers: true, encoding: 'bom|utf-8' }.freeze

# File content is: "\xEF\xBB\xBFid\n12"
first_row = CSV.read('bom_content.csv', CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF")     # This returns false

考虑到我需要StringIO直接使用，为什么会CSV忽略bom|utf-8编码？有没有办法从StringIO实例中删除 BOM 字符？

谢谢！

score 3 · Accepted Answer

Ruby 2.7 将该set_encoding_by_bom方法添加到IO. 此方法使用字节顺序标记并设置编码。

require 'csv'
require 'stringio'

CSV_READ_OPTIONS = { headers: true }.freeze

content = StringIO.new("\xEF\xBB\xBFid\n123")
content.set_encoding_by_bom

first_row = CSV.parse(content, CSV_READ_OPTIONS).first
first_row.headers.first.include?("\xEF\xBB\xBF")
#=> false

score 2 · Accepted Answer

Ruby 不喜欢 BOM。它只在读取文件时处理它们，而不是在其他任何地方处理它们，即使那样它也只读取它们以便可以摆脱它们。如果您想要字符串的 BOM 或写入文件时的 BOM，则必须手动处理它。

这样做可能有一些宝石，尽管自己很容易做到

if string[0...3] == "\xef\xbb\xbf"
  string = string[3..-1].force_encoding('UTF-8')
elsif string[0...2] == "\xff\xfe"
  string = string[2..-1].force_encoding('UTF-16LE')
# etc

score 2 · Accepted Answer

我发现在 StringIO 上强制编码为 utf8string并删除 BOM 以生成新的 StringIO 有效：

require 'csv'
CSV_READ_OPTIONS = { headers: true}.freeze
content = StringIO.new("\xEF\xBB\xBFid\n123")
csv_file = StringIO.new(content.string.force_encoding('utf-8').sub("\xEF\xBB\xBF", ''))
first_row = CSV.parse(csv_file, CSV_READ_OPTIONS).first

first_row.headers.first.include?("\xEF\xBB\xBF") # => false

encoding不再需要该选项。它可能不是记忆方面的最佳选择，但它确实有效。

ruby - Ruby CSV BOM|StringIO 的 UTF-8 编码

3 回答 3

Related

Reference