2

我想解析一个大的日志文件(大约 500mb)。如果这不是适合这项工作的工具,请告诉我。

我有一个日志文件,其内容结构如下。每个部分都可以有额外的键值对:

requestID: saldksadk
time: 92389389
action: foobarr
----------------------
requestID: 2393029
time: 92389389
action: helloworld
source: email
----------------------
requestID: skjflkjasf3
time: 92389389
userAgent: mobile browser
----------------------
requestID: gdfgfdsdf
time: 92389389
action: randoms

我想知道是否有一种简单的方法来处理日志中每个部分的数据。一个部分可以跨越多行,所以我不能只拆分字符串。例如,有没有一种简单的方法来做这样的事情:

for(section in log){
   // handle section contents
}
4

5 回答 5

5

使用 icktoofay 的想法,并通过使用自定义记录分隔符,我得到了这个:

require 'yaml'

File.open("path/to/file") do |f|
  f.each_line("\n----------------------\n") do |line|
    puts YAML::load(line.sub(/\-{3,}/, "---")).inspect
  end
end

输出:

{"requestID"=>"saldksadk", "time"=>92389389, "action"=>"foobarr"}
{"requestID"=>2393029, "time"=>92389389, "action"=>"helloworld", "source"=>"email"}
{"requestID"=>"skjflkjasf3", "time"=>92389389, "userAgent"=>"mobile browser"}
{"requestID"=>"gdfgfdsdf", "time"=>92389389, "action"=>"randoms"}
于 2013-06-07T03:29:48.653 回答
4

这看起来 YAML,尽管它不完全是 YAML。(YAML 仅用三个破折号分隔文档,仅此而已。)您可能会尝试以某种方式破坏您的文档,以便仅由连字符组成的行折叠成三个连字符,因此它是有效的 YAML。之后,您可以将其输入 YAML 解析器。

于 2013-06-07T02:44:25.007 回答
3

我将您的示例文本保存到一个名为“test.txt”的文件中。打开它:

File.foreach('test.txt').slice_before(/^---/).to_a

返回:

[
  ["requestID: saldksadk\n", "time: 92389389\n", "action: foobarr\n"], 
  ["----------------------\n", "requestID: 2393029\n", "time: 92389389\n", "action: helloworld\n", "source: email\n"], 
  ["----------------------\n", "requestID: skjflkjasf3\n", "time: 92389389\n", "userAgent: mobile browser\n"], 
  ["----------------------\n", "requestID: gdfgfdsdf\n", "time: 92389389\n", "action: randoms\n"]
]

通过过滤器运行每个子数组,我们可以去掉前导的“---”:

blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
  ary.shift if ary.first[/^---/]
  ary.map(&:chomp)
}

运行后blocks就是:

[
  ["requestID: saldksadk", "time: 92389389", "action: foobarr"],
  ["requestID: 2393029", "time: 92389389", "action: helloworld", "source: email"],
  ["requestID: skjflkjasf3", "time: 92389389", "userAgent: mobile browser"],
  ["requestID: gdfgfdsdf", "time: 92389389", "action: randoms"]
]

稍微调整一下:

blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
  ary.shift if ary.first[/^---/]
  Hash[ary.map{ |s| s.chomp.split(':') }]
}

并且blocks将是:

[
  {"requestID"=>" saldksadk", "time"=>" 92389389", "action"=>" foobarr"},
  {"requestID"=>" 2393029", "time"=>" 92389389", "action"=>" helloworld", "source"=>" email"},
  {"requestID"=>" skjflkjasf3", "time"=>" 92389389", "userAgent"=>" mobile browser"},
  {"requestID"=>" gdfgfdsdf", "time"=>" 92389389", "action"=>" randoms"}
]
于 2013-06-07T03:24:13.567 回答
3

您可以逐行阅读文件。对于每一行,我们将检查它是记录分隔符还是键:值对。如果是前者,我们会将当前记录添加到记录列表中。如果是后者,我们会将 k:v 对添加到当前记录中。

records = []
record = {}
open("data.txt", "r").each do |line|
  if line.start_with? "-"
    records << record unless record.empty?
    record = {}
  else
    k, v = line.split(":", 2).map(&:strip)
    record[k] = v
  end
end
records << record unless record.empty?

这会产生类似的东西:

[{"requestID"=>"saldksadk", "time"=>"92389389", "action"=>"foobarr"},
 {"requestID"=>"2393029", "time"=>"92389389", "action"=>"helloworld", "source"=>"email"},
 {"requestID"=>"skjflkjasf3", "time"=>"92389389", "userAgent"=>"mobile browser"}, 
 {"requestID"=>"gdfgfdsdf", "time"=>"92389389", "action"=>"randoms"}]
于 2013-06-07T03:34:08.020 回答
1

非常基本的方法,它保持简单和高效:

blocks = []
current_block = {}

sep_range = 0..3
sep_value = "----"

split_pattern = /:\s*/

File.open("filename.txt", 'r') do |f|
  f.each_line do |line|
    if line[sep_range] == sep_value
      blocks << current_block unless current_block.empty?
      current_block = {}
    else
      key, value = line.split(split_pattern, 2)
      current_block[key] = value
    end
  end
end

blocks << current_block unless current_block.empty?

需要指出的关键是,我们避免在循环内创建不必要的重复对象(范围、测试字符串和拆分正则表达式模式),而是在循环开始之前定义它们,这样可以节省一点时间和内存。在 500mb 的文件上,这可能很重要。

于 2013-06-07T03:39:12.870 回答