linux - 将 tr 的结果作为 awk 中的第二个参数传递

Question

我的命令：

awk 'NR==FNR{a[$0]=1;next;} substr($0,50,6) in a' file1 file2

问题是文件 2 包含\000字符并且 awk 将其视为二进制文件。

替换\000为空格字符：

tr '\000' ' ' < file2 > file2_not_binary

解决二进制文件问题。

但是我的 file2 是一个 20GB 的文件。而且我不想tr单独做并将结果另存为另一个文件。我想将结果传递给trto awk。

我试过了：

awk 'NR==FNR{a[$0]=1;next;} substr($0,50,6) in a' file1 < (tr '\000' ' ' < file2)

但结果是：

The system cannot find the file specified.

另一个问题是：我的内存或 awk 可以一次处理这么大的文件吗？我正在使用 12GB RAM 的 PC。

编辑

答案之一如我所料（感谢 Ed Morton）

tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -

然而，它比在 2 个步骤中执行相同操作要慢 2 倍 - 首先删除\000并保存它，然后awk用于搜索。我怎样才能加快速度？

编辑2

我的错。Ed Morton 解决方案实际上比在两个单独的命令中执行相同操作要快一点。

两个命令分别：08:37:053

两个命令管道：08:07:204

score 3 · Accepted Answer

由于 awk 没有将您的第二个文件存储在内存中，因此该文件的大小与执行速度无关。试试这个：

tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -

score 2 · Accepted Answer

它应该是：

awk ... <(tr -d '\0' < file2)
# -------^ no space!

查看有关Process Substitution的手册。

score 1 · Accepted Answer

您可以使用 awk 替换它gsub(/\000/," ")。测试，我们做一个测试文件：

$ awk 'BEGIN{print "a b\000c d"}' > foo
$ hexdump -C foo
00000000  61 20 62 00 63 20 64 0a                           |a b.c d.|
00000008

进而：

$ awk '{print; gsub(/\000/," "); print}' foo
a bc d
a b c d

3 回答 3