linux - how to subtract the two files in linux

Question

I have two files like below:

file1

"Connect"    CONNECT_ID="12"
"Connect"    CONNECT_ID="11"
"Connect"    CONNECT_ID="122"
"Connect"    CONNECT_ID="109"

file2

"Quit"    CONNECT_ID="12"
"Quit"    CONNECT_ID="11"

The file contents are not exactly same but similar to above and the number of records are minimum 100,000.

Now i want to get the result as show below into file1 (means the final result should be there in file1)

"Connect"    CONNECT_ID="122"
"Connect"    CONNECT_ID="109"

I have used a while loop something like below:

awk {'print $2'} file2 | sed "s/CONNECTION_ID=//g" > sample.txt

while read actual; do

    grep -w -v $actual file1 > file1_tmp
    mv -f file1_tmp file1

done < sample.txt

Here I have adjusted my code according to example. So it may or may not work.

My problem is the loop is repeating for more than 1 hour to complete the process.

So can any one suggest me how to achieve the same with any other ways like using diff or comm or sed or awk or any other linux command which will run faster?

Here mainly I want to eliminate this big typical while loop.

score 6 · Accepted Answer

大多数 UNIX 工具都是基于行的，因为您没有整行匹配，这意味着grep，comm并且diff不在窗口中。像你想要的那样提取基于字段的信息awk是完美的：

$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1
"Connect"    CONNECT_ID="122"
"Connect"    CONNECT_ID="109"

要将结果存储回file1您需要将输出重定向到临时文件，然后将文件移动到file1如下所示：

$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1 > tmp && mv tmp file1

解释：

每读取一条记录，即每个文件中的每一行，awk变量都会递增。NR每个记录的FNR变量都会增加，但每个文件都会重置。

NR==FNR    # This condition is only true when reading file1
a[$2]      # Add the second field in file1 into array as a lookup table
next       # Get the next line in file1 (skips any following blocks)
!($2 in a) # We are now looking at file2 if the second field not in the look up
           # array execute the default block i.e print the line

要修改此命令，您只需更改匹配的字段。在您的实际情况下，如果您想将字段 1file1与字段 4 匹配，file2那么您将执行以下操作：

$ awk 'NR==FNR{a[$1];next}!($4 in a)' file2 file1

score 4 · Accepted Answer

4

这可能对您有用（GNU sed）：

sed -r 's|\S+\s+(\S+)|/\1/d|' file2 | sed -f - -i file1

于 2013-08-15T21:08:24.013 回答

score 2 · Accepted Answer

最适合这项工作的工具是join(1). 它根据每个文件的给定列中的值连接两个文件。通常它只输出两个文件中匹配的行，但它也有一种模式来输出其中一个文件与另一个文件不匹配的行。

join要求在您要加入的字段上对文件进行排序，因此要么对文件进行预排序，要么使用进程替换（一种bash功能 - 如下例所示）在一个命令行上执行此操作：

$ join -j 2 -v 1 -o "1.1 1.2" <(sort -k2,2 file1) <(sort -k2,2 file2)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"

-j 2表示要在两个文件的第二个字段中加入文件。

-v 1说只输出文件 1 中不匹配文件 2 中的任何字段

-o "1.1 1.2"1.1表示使用文件 1 ( ) 的第一个字段以及文件 1 ( ) 的第二个字段对输出进行排序1.2。如果没有这个，join将首先输出连接列，然后是其余列。

score 0 · Accepted Answer

主要瓶颈不是真正的while循环，而是您重写输出文件数千次的事实。

在您的特定情况下，您也许可以摆脱这个：

cut -f2 file2 | grep -Fwvf - file1 >tmp
mv tmp file1

（我认为该-w选项在grep这里没有用，但是由于您在示例中使用了它，因此我保留了它。）

这是以file2制表符分隔的为前提的；如果没有，awk '{ print $2 }' file2你在那里的很好。

score 0 · Accepted Answer

您可能需要首先分析file2，并将所有已出现在缓存（例如内存）中的ID附加到缓存中，然后逐行扫描file1以调整缓存中的ID。

像这样的python代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import re

p = re.compile(r'CONNECT_ID="(.*)"')

quit_ids = set([])

for line in open('file2'):
    m = p.search(line)
    if m:
        quit_ids.add(m.group(1))


output = open('output_file', 'w')
for line in open('file1'):
    m = p.search(line)
    if m and m.group(1) not in quit_ids:
        output.write(line)
output.close()

linux - how to subtract the two files in linux

5 回答 5

Related

Reference