1

I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. The order of the data matters.

I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. -H had no effect.

In my application, I only need to stream the diff results. That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. If there are differences, a quick glance will tell me what is wrong.

'comm' seems well suited to this, but it does not display line numbers of the differences.

In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same.

Is there a way to get comm to dump the line number? Or a way to make diff run without loading the entire file into memory? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)?

4

2 回答 2

1

我不会使用comm,但正如你所说的你需要什么,除了你认为你应该怎么做之外,我将专注于“你需要什么”:

一个有趣的方法是使用pasteand awk:paste可以使用分隔符“并排”显示 2 个文件。如果您\n用作分隔符,它将显示 2 个文件,每个文件的第 1 行,然后是每个文件的第 2 行,依此类推。

因此,您可以使用的脚本可能很简单(一旦您知道每个文件中的行数相同):

 paste -d '\n' /tmp/file1  /tmp/file2 | awk '
        NR%2  { linefirstfile=$0 ; } 
      !(NR%2) { if ( $0 != linefirstfile )
                      { print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'

(有趣的是,该解决方案将允许轻松扩展以在一次读取中对 N 个文件进行比较,无论 N 个文件的大小是多少......只需在执行比较步骤之前添加一个检查所有具有相同数量的行(否则“粘贴”最终将只显示较大文件中的行))

这是一个(简短的)示例,以显示其工作原理:

$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E

$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E

$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E

$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
     NR%2  { linefirstfile=$0 ; }
   !(NR%2) { if ( $0 != linefirstfile ) 
               { print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf

如果碰巧文件没有相同数量的行,那么您可以先添加对行数的检查,comparing $(wc -l /tmp/file1)然后$(wc -l /tmp/file2),然后只做过去...|awk,如果它们具有相同的行数,以确保“粘贴”正常工作,始终各有一行!(但当然,在这种情况下,每个文件都会有一次(快速!)完整读取......)

您可以轻松调整它以完全按照您的需要显示。您可以在第 N 个差异后退出(自动,在 awk 循环中使用计数器,或者在看到足够多时按 CTRL-C)

于 2013-04-30T17:14:56.850 回答
0

您尝试过哪些版本的 diff?GNU diff 有一个“--speed-large-files”可能会有所帮助。

通讯工具假定行已排序。

于 2013-04-30T18:39:46.637 回答