bash - 重击。从多个文件中获取交集

Question

所以让我再解释一下：

我有一个名为 tags 的目录，其中每个标签都有一个文件，例如：

tags/
    t1
    t2
    t3

在每个标记文件中都有如下结构：

<inode> <filename> <filepath>

当然，每个标签文件都会有一个包含许多带有该标签的文件的列表（但一个文件只能在一个标签文件中出现一次）。并且一个文件可能在多个标签文件中。

我想要做的是调用类似的命令

tags <t1> <t2>

并让它以一种很好的方式列出同时具有标签 t1 和 t2 的文件。

我现在的计划是制作一个临时文件。基本上将t1的整个文件输出到其中。然后遍历 t2 中的每一行并对文件执行 awk。继续这样做。

但我想知道是否有人有其他方法。我对 awk、grep 等不太熟悉。

score 20 · Accepted Answer

您可以尝试使用comm实用程序

comm -12 <t1> <t2>

comm适当组合以下选项对于文件内容的不同设置操作很有用。

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

这假设<t1>并<t2>排序。如果不是，则应首先对它们进行排序sort

score 20 · Accepted Answer

你能用吗

sort t1 t2 | uniq -d

这将合并两个文件，对它们进行排序，然后只显示多次出现的行：即出现在两个文件中的行。

这假定每个文件中不包含重复项，并且特定文件的所有结构中的 inode 都是相同的。

score 3 · Accepted Answer

使用awk，很容易创建适用于任意数量的未排序文件的单命令解决方案。对于大文件，它比使用和管道要快得多sort，如下所示。通过更改$0为$1等，您还可以找到特定列的交集。

我提供了 3 个解决方案：一个不处理文件中重复行的简单解决方案；一个更复杂的可以处理它们的；还有一个更复杂的也可以处理它们并且（过度）设计以提高性能。解决方案#1 和#2 假设它的版本awk具有FNR变量，解决方案#3 需要gawk's ENDFILE（尽管可以通过使用FNR == 1而不是重新安排一些逻辑来规避这种情况）。

解决方案#1（不处理文件中的重复行）：

awk ' FNR == 1 { b++ } { a[$0]++ } END { for (i in a) { if (a[i] == b) { print i } } } ' \
    t1 t2 t3

解决方案 #2（处理文件中的重复行）：

awk ' FNR == 1 { b++ ; delete c }
      c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    t1 t2 t3

解决方案#3（高性能，处理文件中的重复项，但很复杂，并且书面依赖于gawk's ENDFILE）：

awk ' b == 0 { a[$0] = 0 ; next } 
      $0 in a { a[$0] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
    t1 t2 t3

#1 的解释：

FNR == 1 { b++ }              # when awk reads the first line of a new file, FNR resets 
                              # to 1. every time FNR == 1, we increment a counter 
                              # variable b. 
                              # this counts the number of input files.

{ a[$0]++ }                   # on every line in every file, take the whole line ( $0 ), 
                              # use it as a key in the array a, and increase the value 
                              # of a[$0] by 1.
                              # this counts the number of observations of line $0 across 
                              # all input files.

END { ... }                   # after reading the last line of the last file...

for (i in a) { ... }          # ... loop over the keys of array a ...

if (a[i] == b) { ... }        # ... and if the value at that key is equal to the number 
                              # of input files...

print i                       # ... we print the key - i.e. the line.

#2 的解释：

c[$0] == 0 { a[$0]++ ; c[$0] = 1 }  # as above, but now we include an array c that 
                                    # indicates if we've seen lines *within* each file.
                                    # if we haven't seen the line before in this file, we 
                                    # increment the count at that line(/key) in array a. 
                                    # we also set the value at that key in array c to 1 
                                    # to note that we've now seen it in this file before.

FNR == 1 { b++ ; delete c }         # as previous solution, but now we also clear the 
                                    # array c between files.

#3 的解释：

这篇文章已经很长了，所以我不会为这个解决方案逐行做。但简而言之：1）我们创建一个数组a，其中包含第一个文件中的每一行作为键，所有值都设置为0；2) 在后续文件中，如果该行是中的一个键a，我们将该键处的值设置为1; 3）在每个文件的末尾，我们删除所有a有值的键0（表明我们在前一个文件中没有看到它），并将所有剩余的值重置为0；4）读取所有文件后，打印剩余的每个键a. 我们在这里得到了很好的加速，因为我们不必保留和搜索到目前为止我们所看到的每一行的数组，我们只保留所有先前文件的交集的行数组，其中 (通常！）随着每个新文件缩小。

基准测试：

注意：随着文件中的行变得更长，运行时的改进似乎变得更加显着。

### Create test data with *no duplicated lines within files*

mkdir test_dir; cd test_dir

for i in {1..30}; do shuf -i 1-540000 -n 500000 > test_no_dups${i}.txt; done

### Solution #0: based on sort and uniq

time sort test_no_dups*.txt | uniq -c | sed -n 's/^ *30 //p' > intersect_no_dups.txt

# real    0m12.982s
# user    0m51.594s
# sys     0m3.250s

wc -l < intersect_no_dups.txt # 53772

### Solution #1:

time \
awk ' FNR == 1 { b++ }
      { a[$0]++ } 
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m8.048s
# user    0m7.484s
# sys     0m0.313s

wc -l < intersect_no_dups.txt # 53772

### Solution #2:

time \
awk ' FNR == 1 { b++ ; delete c }
      c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m14.965s
# user    0m14.688s
# sys     0m0.297s

wc -l < intersect_no_dups.txt # 53772

### Solution #3:

time \
awk ' b == 0 { a[$0] = 0 ; next } 
      $0 in a { a[$0] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
      test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m5.929s
# user    0m5.672s
# sys     0m0.250s

wc -l < intersect_no_dups.txt # 53772

如果文件可以包含重复项：

### Create test data containing repeated lines (-r: sample w/ replacement)

for i in {1..30} ; do
    shuf -r -i 1-150000 -n 500000 > test_dups${i}.txt
done


### Solution #0: based on sort and uniq

time \
for i in test_dups*.txt ; do
    sort -u "$i"
done \
| sort \
| uniq -c \
| sed -n 's/^ *30 //p' \
> intersect_dups.txt

# real   0m13.503s
# user   0m26.688s
# sys    0m2.297s

wc -l < intersect_dups.txt # 50389

### [Solution #1 won't work here]

### Solution #2:

# note: `delete c` can be replaced with `split("", c)`
time \
awk ' FNR == 1 { b++ ; delete c }
      c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_dups*.txt \
  > intersect_dups.txt

# real   0m7.097s
# user   0m6.891s
# sys    0m0.188s

wc -l < intersect_dups.txt # 50389

### Solution #3:

time \
awk ' b == 0 { a[$0] = 0 ; next } 
      $0 in a { a[$0] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
      test_dups*.txt \
  > intersect_dups.txt

# real   0m4.616s
# user   0m4.375s
# sys    0m0.234s

wc -l < intersect_dups.txt # 50389

score 0 · Accepted Answer

Version for multiple files:

eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`

Expands to:

cat t1 | grep -xF -f- t2 | grep -xF -f- t3

Test files:

seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3

Output:

bash - 重击。从多个文件中获取交集

4 回答 4

Related

Reference