1

我有一个要解析的文本文件。在此文件中,每条记录的内容分布在可变数量的行中。每条记录的行数不是一个固定的数字。该文件的内容如下所示:

ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent

我想在第一个选项卡列中有记录的地方对其进行切片(以下行中的 ID 列为空,因此这种确定新记录的方式应该可以工作)。

我当前的代码将它分成五行的块然后合并它:

f = File.read(file).each_line
f.each_slice(5) do | slice_to_handle |
  merged_row = slice_to_handle.delete("\n").split("\t").collect(&:strip)
  # Dealing with the data here..
end

只要在第一列中设置了 ID,我就需要对其进行修改以对其进行切片。

4

2 回答 2

0

Ruby's Array inherits from Enumerable, which has slice_before, which is your friend:

text_file = "ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
\tcontent\tcontent
ID\tcontent\tcontent
\tcontent\tcontent".split("\n")

text_file.slice_before(/^ID/).map(&:join) 

Which looks like:

[
  "ID\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent",
  "ID\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent\tcontent",
  "ID\tcontent\tcontent\tcontent\tcontent"
]

text_file is an array of lines, similar to what you'd get if you slurped a file using readlines.

slice_before iterates over the array looking for matches to the /^ID/ pattern, and creates a new sub-array each time it's found.

map(&:join) walks over the sub-arrays and joins their contents into a single string.

This is not very scalable though. Using it, you'd be relying on being able to slurp in the entire file into memory, which can stop a machine in its tracks. Instead, it's better to read the content line-by-line and break the blocks and process them as soon as possible.

于 2013-07-09T16:15:14.610 回答
0
File.read(file)
.split(/^(?!\t)/)
.map{|record| record.split("\t").map(&:strip)}

结果

[
  [
    "ID",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content"
  ],
  [
    "ID",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content",
    "content"
  ],
  [
    "ID",
    "content",
    "content",
    "content",
    "content"
  ]
]
于 2013-07-09T12:41:39.510 回答