regex - Shell 脚本/正则表达式：跨多行提取

Question

我正在尝试编写一个日志解析脚本来提取失败的事件。我可以用 grep 拉出这些：

$ grep -A5 "FAILED" log.txt

2008-08-19 17:50:07 [7052] [14] DEBUG:      data: 3a 46 41 49 4c 45 44 20 20 65 72 72 3a 30 32 33   :FAILED  err:023
2008-08-19 17:50:07 [7052] [14] DEBUG:      data: 20 74 65 78 74 3a 20 00                            text: .
2008-08-19 17:50:07 [7052] [14] DEBUG:    Octet string dump ends.
2008-08-19 17:50:07 [7052] [14] DEBUG: SMPP PDU dump ends.
2008-08-19 17:50:07 [7052] [14] DEBUG: SMPP[test] handle_pdu, got DLR
2008-08-19 17:50:07 [7052] [14] DEBUG: DLR[internal]: Looking for DLR smsc=test, ts=1158667543, dst=447872123456, type=2
--
2008-08-19 17:50:07 [7052] [8] DEBUG:      data: 3a 46 41 49 4c 45 44 20 20 65 72 72 3a 30 32 34   :FAILED  err:024
2008-08-19 17:50:07 [7052] [8] DEBUG:      data: 20 74 65 78 74 3a 20 00                            text: .
2008-08-19 17:50:07 [7052] [8] DEBUG:    Octet string dump ends.
2008-08-19 17:50:07 [7052] [8] DEBUG: SMPP PDU dump ends.
2008-08-19 17:50:07 [7052] [8] DEBUG: SMPP[test] handle_pdu, got DLR
2008-08-19 17:50:07 [7052] [8] DEBUG: DLR[internal]: Looking for DLR smsc=test, ts=1040097716, dst=447872987654, type=2

我感兴趣的是，对于每个块，错误代码（即第一行的“:FAILED err:023”的“023”部分）和 dst 编号（即“dst=447872123456”中的“447872123456”在最后一行。）

任何人都可以帮助使用 shell one-liner 来提取这两个值，或者提供一些关于我应该如何处理这个值的提示吗？

score 2 · Accepted Answer

grep -A 5 FAILED log.txt | \              # Get FAILED and dst and other lines
    egrep '(FAILED|dst=)' | \             # Just the FAILED/dst lines
    egrep -o "err:[0-9]*|dst=[0-9]*" | \  # Just the err: and dst= phrases
    cut -d':' -f 2 | \                    # Strip "err:" from err: lines
    cut -d '=' -f 2 | \                   # Strip "dst=" from dst= lines
    xargs -n 2                            # Combine pairs of numbers

023 447872123456
024 447872987654

与所有 shell "one"-liner 一样，几乎可以肯定有一种更优雅的方式来做到这一点。但是，我发现迭代方法非常成功地获得了我想要的东西：从太多信息（你的 grep）开始，然后缩小我想要的行（使用 grep），然后剪掉我想要的每一行的部分（使用切）。

虽然使用 linux 工具箱需要更多的代码，但您只需要了解一些命令的基础知识就可以完成任何您想做的事情。另一种方法是使用 awk、python 或其他脚本语言，它们需要更专业的编程知识，但占用的屏幕空间更少。

score 0 · Accepted Answer

Ruby 中的一个简单解决方案，这里是filter.rb：

#! /usr/bin/env ruby
File.read(ARGV.first).scan(/:FAILED\s+err:(\d+).*?, dst=(\d+),/m).each do |err, dst|
  puts "#{err} #{dst}"
end

运行它：

ruby filter.rb my_log_file.txt

你得到：

023 447872123456
024 447872987654

score 0 · Accepted Answer

如果总是有相同数量的字段，您可以

grep -A5 "FAILED" log.txt | awk '$24~/err/ {print $24} $12~/dst/{print $12}' error.txt

err:023
dst=447872123456,
err:024
dst=447872987654,

并且根据文件其余部分的外观，您可能可以一起跳过 grep。

“ $24~/err/ {print $24} ” 部分告诉 awk 打印字段编号 24，如果它包含 err, ~/XXX/ 其中 XXX 是正则表达式。

regex - Shell 脚本/正则表达式：跨多行提取

3 回答 3

Related

Reference