0

我需要在多个文件中找到共同的行;超过 100 个文件,每个文件有数百万行。与此类似:Shell: Find Matching Lines Across Many Files

但是,我不仅想找到所有文件中的共享行,还想找到在除一个之外的所有文件、除两个之外的所有文件中找到的那些行,依此类推。我有兴趣使用百分比来做到这一点。例如,哪些条目出现在 90% 的文件中,80%、70% 等等。举个例子:

文件 1

lineA
lineB
lineC

文件2

lineB
lineC
lineD

文件 3

lineC
lineE
lineF

为演示起见的假设输出:

<lineC> is found in 3 out of 3 files (100.00%)

<lineB> is found in 2 out of 3 files (66.67%)

<lineF> is found in 1 out of 3 files (33.33%)

有谁知道该怎么做?

非常感谢!

4

1 回答 1

2

使用 GNU awk 为其多维数组:

gawk '
    BEGIN {nfiles = ARGC-1}
    { lines[$0][FILENAME] = 1 }
    END {
        for (line in lines) {
            n = length(lines[line])
            printf "<%s> is found in %d of %d files (%.2f%%)\n", line, n, nfiles, 100*n/nfiles
        }
    }
' file{1,2,3}
<lineA> is found in 1 of 3 files (33.33%)
<lineB> is found in 2 of 3 files (66.67%)
<lineC> is found in 3 of 3 files (100.00%)
<lineD> is found in 1 of 3 files (33.33%)
<lineE> is found in 1 of 3 files (33.33%)
<lineF> is found in 1 of 3 files (33.33%)

输出顺序不定

于 2018-02-22T17:32:47.670 回答