linux - How to display line numbers when comparing files with linux "comm" tool

Question

I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. The order of the data matters.

I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. -H had no effect.

In my application, I only need to stream the diff results. That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. If there are differences, a quick glance will tell me what is wrong.

'comm' seems well suited to this, but it does not display line numbers of the differences.

In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same.

Is there a way to get comm to dump the line number? Or a way to make diff run without loading the entire file into memory? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)?

score 1 · Accepted Answer

我不会使用comm，但正如你所说的你需要什么，除了你认为你应该怎么做之外，我将专注于“你需要什么”：

一个有趣的方法是使用pasteand awk:paste可以使用分隔符“并排”显示 2 个文件。如果您\n用作分隔符，它将显示 2 个文件，每个文件的第 1 行，然后是每个文件的第 2 行，依此类推。

因此，您可以使用的脚本可能很简单（一旦您知道每个文件中的行数相同）：

 paste -d '\n' /tmp/file1  /tmp/file2 | awk '
        NR%2  { linefirstfile=$0 ; } 
      !(NR%2) { if ( $0 != linefirstfile )
                      { print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'

（有趣的是，该解决方案将允许轻松扩展以在一次读取中对 N 个文件进行比较，无论 N 个文件的大小是多少......只需在执行比较步骤之前添加一个检查所有具有相同数量的行（否则“粘贴”最终将只显示较大文件中的行））

这是一个（简短的）示例，以显示其工作原理：

$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E

$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E

$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E

$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
     NR%2  { linefirstfile=$0 ; }
   !(NR%2) { if ( $0 != linefirstfile ) 
               { print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf

如果碰巧文件没有相同数量的行，那么您可以先添加对行数的检查，comparing $(wc -l /tmp/file1)然后$(wc -l /tmp/file2)，然后只做过去...|awk，如果它们具有相同的行数，以确保“粘贴”正常工作，始终各有一行！（但当然，在这种情况下，每个文件都会有一次（快速！）完整读取......）

您可以轻松调整它以完全按照您的需要显示。您可以在第 N 个差异后退出（自动，在 awk 循环中使用计数器，或者在看到足够多时按 CTRL-C）

score 0 · Accepted Answer

您尝试过哪些版本的 diff？GNU diff 有一个“--speed-large-files”可能会有所帮助。

通讯工具假定行已排序。

linux - How to display line numbers when comparing files with linux "comm" tool

2 回答 2

Related

Reference