python - grep、awk、bash 和朋友？有什么工具可以处理这些数据？

Question

我正在搜索输入以提取有关每条记录的特定信息。可悲的是，每条记录都分布在多行上，例如（简化摘录）

01238584 (other info) more info, more info
[age=81][otherinfo][etc, etc]

我唯一真正关心的是标识符和年龄（在示例中为01238584和81）。为了清楚起见，我可以在输入中可靠地搜索以接近这两行的唯一正则表达式是

\[age=[0-9]+\]

...当然我想打印出那个年龄以及上面一行的识别记录号，例如

 01238584   81

凭借我所有的系统管理员 shell 经验和良好的 awk 掌握，我还没有想出解决方案。我当然可以grep -B1用来获取每组线，但那又如何呢？我总是将 awk 用于这类事情......但相关的数据总是在同一行。sigh这绝对超出了我目前的 awk 技能。

谢谢阅读。有任何指示吗？

编辑
我将接受查理的建议并更改 awk 的记录分隔符，这是我以前从未做过的。它不漂亮，但输入也不是。工作完成。

egrep -B1 '\[age=[0-9]+\]' inputfile |
awk '
  BEGIN{ RS = "--" }
  { printf "%s  %s\n", $1, gensub(/.*\[age=([0-9]+)\].*/, "\\1", 1) }'

score 4 · Accepted Answer

你能显示更多的输入文件吗？例如，如果数据记录由空行分隔，您可以使用 Awk 中的 RS 特殊变量更改记录分隔符，使其将多行视为一条记录。（参见，例如， http: //www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_19.html）

在任何情况下，我都会尝试将所有数据记录放在一行或一个逻辑记录中。

如果你不能这样做，但你知道记录 ID 总是在年龄标记之前的行上，那么在 Python 中使用 readlines 很容易做到，它将整个文件读入行列表，就像这样

 with open("file.dat") as f:
     lines = f.readlines()
     for ix, line in enumerate(lines):
         if # line has age field
            # get record from lines[ix-1]

或者，当然，您总是可以将上一行保留在 Awk 的内存中

 BEGIN { prevline = "" }
       { # process the line
         prevline = $0
       }

score 2 · Accepted Answer

在这种情况下，Perl 可以轻松地充当您的朋友。您可以将整个文件读入内存以在多行上应用正则表达式。将输入记录分隔符设置为0777会导致这种“啜饮”动作。开关只是说读取命令行上提供的-n一个或多个文件。开关的-e参数构成要执行的代码。

正则表达式的/s修饰符允许.匹配换行符。\m修饰符允许^和在$嵌入换行符之前和之后立即匹配。这些是解析包含多个逻辑行的字符串的关键。修饰符告诉正/g则表达式引擎全局搜索所有匹配项。

perl -0777 -ne 'print "$1 $2\n" while m{^(\S+).+?\[age=(\d+)\]}gms' file

给定这样的输入文件：

01238584 (other info) more info, more info
[age=81][otherinfo][etc, etc]
98765432 (still other info) still more info, and more info
[age=82][and more otherinfo][etc, etc, ad infinitum]

...上面的脚本输出：

01238584 81
98765432 82

我们可以这样剖析正则表达式：

perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/m{^(\S+).

+?[age=(\d+)]}gms/)->explain()'

The regular expression:

(?-imsx:m{^(\S+).+?\[age=(\d+)\]}gms)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  m{                       'm{'
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  .+?                      any character except \n (1 or more times
                           (matching the least amount possible))
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  age=                     'age='
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \]                       ']'
----------------------------------------------------------------------
  }gms                     '}gms'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

python - grep、awk、bash 和朋友？有什么工具可以处理这些数据？

2 回答 2

Related

Reference