使用awk
,很容易创建适用于任意数量的未排序文件的单命令解决方案。对于大文件,它比使用和管道要快得多sort
,如下所示。通过更改$0
为$1
等,您还可以找到特定列的交集。
我提供了 3 个解决方案:一个不处理文件中重复行的简单解决方案;一个更复杂的可以处理它们的;还有一个更复杂的也可以处理它们并且(过度)设计以提高性能。解决方案#1 和#2 假设它的版本awk
具有FNR
变量,解决方案#3 需要gawk
's ENDFILE
(尽管可以通过使用FNR == 1
而不是重新安排一些逻辑来规避这种情况)。
解决方案#1(不处理文件中的重复行):
awk ' FNR == 1 { b++ } { a[$0]++ } END { for (i in a) { if (a[i] == b) { print i } } } ' \
t1 t2 t3
解决方案 #2(处理文件中的重复行):
awk ' FNR == 1 { b++ ; delete c }
c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
t1 t2 t3
解决方案#3(高性能,处理文件中的重复项,但很复杂,并且书面依赖于gawk
's ENDFILE
):
awk ' b == 0 { a[$0] = 0 ; next }
$0 in a { a[$0] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
t1 t2 t3
#1 的解释:
FNR == 1 { b++ } # when awk reads the first line of a new file, FNR resets
# to 1. every time FNR == 1, we increment a counter
# variable b.
# this counts the number of input files.
{ a[$0]++ } # on every line in every file, take the whole line ( $0 ),
# use it as a key in the array a, and increase the value
# of a[$0] by 1.
# this counts the number of observations of line $0 across
# all input files.
END { ... } # after reading the last line of the last file...
for (i in a) { ... } # ... loop over the keys of array a ...
if (a[i] == b) { ... } # ... and if the value at that key is equal to the number
# of input files...
print i # ... we print the key - i.e. the line.
#2 的解释:
c[$0] == 0 { a[$0]++ ; c[$0] = 1 } # as above, but now we include an array c that
# indicates if we've seen lines *within* each file.
# if we haven't seen the line before in this file, we
# increment the count at that line(/key) in array a.
# we also set the value at that key in array c to 1
# to note that we've now seen it in this file before.
FNR == 1 { b++ ; delete c } # as previous solution, but now we also clear the
# array c between files.
#3 的解释:
这篇文章已经很长了,所以我不会为这个解决方案逐行做。但简而言之:1)我们创建一个数组a
,其中包含第一个文件中的每一行作为键,所有值都设置为0
;2) 在后续文件中,如果该行是 中的一个键a
,我们将该键处的值设置为1
; 3)在每个文件的末尾,我们删除所有a
有值的键0
(表明我们在前一个文件中没有看到它),并将所有剩余的值重置为0
;4)读取所有文件后,打印剩余的每个键a
. 我们在这里得到了很好的加速,因为我们不必保留和搜索到目前为止我们所看到的每一行的数组,我们只保留所有先前文件的交集的行数组,其中 (通常!)随着每个新文件缩小。
基准测试:
注意:随着文件中的行变得更长,运行时的改进似乎变得更加显着。
### Create test data with *no duplicated lines within files*
mkdir test_dir; cd test_dir
for i in {1..30}; do shuf -i 1-540000 -n 500000 > test_no_dups${i}.txt; done
### Solution #0: based on sort and uniq
time sort test_no_dups*.txt | uniq -c | sed -n 's/^ *30 //p' > intersect_no_dups.txt
# real 0m12.982s
# user 0m51.594s
# sys 0m3.250s
wc -l < intersect_no_dups.txt # 53772
### Solution #1:
time \
awk ' FNR == 1 { b++ }
{ a[$0]++ }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m8.048s
# user 0m7.484s
# sys 0m0.313s
wc -l < intersect_no_dups.txt # 53772
### Solution #2:
time \
awk ' FNR == 1 { b++ ; delete c }
c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m14.965s
# user 0m14.688s
# sys 0m0.297s
wc -l < intersect_no_dups.txt # 53772
### Solution #3:
time \
awk ' b == 0 { a[$0] = 0 ; next }
$0 in a { a[$0] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m5.929s
# user 0m5.672s
# sys 0m0.250s
wc -l < intersect_no_dups.txt # 53772
如果文件可以包含重复项:
### Create test data containing repeated lines (-r: sample w/ replacement)
for i in {1..30} ; do
shuf -r -i 1-150000 -n 500000 > test_dups${i}.txt
done
### Solution #0: based on sort and uniq
time \
for i in test_dups*.txt ; do
sort -u "$i"
done \
| sort \
| uniq -c \
| sed -n 's/^ *30 //p' \
> intersect_dups.txt
# real 0m13.503s
# user 0m26.688s
# sys 0m2.297s
wc -l < intersect_dups.txt # 50389
### [Solution #1 won't work here]
### Solution #2:
# note: `delete c` can be replaced with `split("", c)`
time \
awk ' FNR == 1 { b++ ; delete c }
c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_dups*.txt \
> intersect_dups.txt
# real 0m7.097s
# user 0m6.891s
# sys 0m0.188s
wc -l < intersect_dups.txt # 50389
### Solution #3:
time \
awk ' b == 0 { a[$0] = 0 ; next }
$0 in a { a[$0] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
test_dups*.txt \
> intersect_dups.txt
# real 0m4.616s
# user 0m4.375s
# sys 0m0.234s
wc -l < intersect_dups.txt # 50389