ruby - 使用 Ruby 将 CSV 文件中的特定记录替换为另一个 CSV 文件中的记录

Question

我有两个大型 CSV 文件。一个文件只是一个记录列表。另一个文件是记录列表，但第一列是它在另一个文件中修改的记录的行号。它不会替换整行；它只是替换具有匹配标题的行中的值。

例如：

文件 1：

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Perl"
"Dane","Joe","Lisp"

文件 2：

"Seq","Lang"
2,"Ruby"

目标是最终得到一个如下所示的文件：

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Ruby"
"Dane","Joe","Lisp"

然而，数据比这复杂得多，甚至可能在 CSV 中包含换行符。因此，我不能依赖行号，而必须依赖记录数。（当然，除非我预处理这两个文件以替换换行符和回车符......我认为这是可能的，但不太有趣。）

我的问题是如何循环遍历这两个文件并进行正确的替换，而不将整个文件中的任何一个加载到内存中。我相信将 100mb+ 文件加载到内存中是个坏主意，对吧？

此外，结果文件中的记录在完成后应该是相同的顺序。

score 1 · Accepted Answer

如果文件太大而无法加载到内存中，这基本上就是我的处理方式

// pseudocode

f1 = fopen(file1)
f2 = fopen(file2)
f3 = fopen(newfile)

// loop through exceptions
foreach row2, index2 of f2

  // loop through file1 until a matched row is found
  while (row1, index1 of f1) && (row1 not null) && (row2[seq] <= index1)

    // patch
    if row2[seq] == index1
      row1[lang] = row2[lang]
    endif

    // write out to new file
    f3.write row1

  endwhile
endforeach

† 由于您的file2has1基于 - 的索引（而不是基于 -），0您将希望在.index1index21

†† 如果lang不是您将始终替换的列：

// at the beginning of the foreach loop
if col is null
  cols = array_keys row2
  col = cols[2] // 1-based index
end

// the new patch block
if row2[seq] == index1
  row1[col] = row2[col]
endif

score 1 · Accepted Answer

您将需要 2 个枚举器，但由于它们不是嵌套的，因此需要使用 Enumerator#next，这意味着您需要小心它引发 EOF 异常：

e = CSV.open('file2.csv', :headers => true).each
seq = e.next

output = CSV.open('output.csv', 'w')

csv = CSV.open('file1.csv')
csv.each do |row|
  if seq['Seq'].to_i == csv.lineno - 1
    row[2] = seq['Lang']
    seq = e.next rescue ({'Seq' => -1})
  end
  output << row
end

ruby - 使用 Ruby 将 CSV 文件中的特定记录替换为另一个 CSV 文件中的记录

2 回答 2

Related

Reference