python - Compare 2 files and remove any lines in file2 when they match values found in file1

Question

I have two files. i am trying to remove any lines in file2 when they match values found in file1. One file has a listing like so:

File1

ZNI008
ZNI009
ZNI010
ZNI011
ZNI012

... over 19463 lines

The second file includes lines that match the items listed in first: File2

copy /Y \\server\foldername\version\20050001_ZNI008_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI010_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI012_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI009_162635.xml \\server\foldername\version\folder\

... continues listing until line 51360

What I've tried so far:

grep -v -i -f file1.txt file2.txt > f3.txt

does not produce any output to f3.txt or remove any lines. I verified by running

wc -l file2.txt

and the result is

51360 file2.txt

I believe the reason is that there are no exact matches. When I run the following it shows nothing

comm -1 -2 file1.txt file2.txt

Running

( tr '\0' '\n' < file1.txt; tr '\0' '\n' < file2.txt ) | sort | uniq -c | egrep -v '^ +1'

shows only one match, even though I can clearly see there is more than one match.

Alternatively putting all the data into one file and running the following:

grep -Ev "$(cat file1.txt)" 1>LinesRemoved.log

says argument has too many lines to process.

I need to remove lines matching the items in file1 from file2.

i am also trying this in python: `

    #!/usr/bin/python
s = set()

# load each line of file1 into memory as elements of a set, 's'
f1 = open("file1.txt", "r")
for line in f1:
    s.add(line.strip())
f1.close()

# open file2 and split each line on "_" separator,
# second field contains the value ZNIxxx
f2 = open("file2.txt", "r")
for line in f2:
    if line[0:4] == "copy":
        fields = line.split("_")
        # check if the field exists in the set 's'
        if fields[1] not in s:
            match = line
        else:
            match = 0
    else:
        if match:
            print match, line,

`

it is not working well.. as im getting 'Traceback (most recent call last): File "./test.py", line 14, in ? if fields[1] not in s: IndexError: list index out of range'

score 10 · Accepted Answer

10

关于什么：

grep -F -v -f file1 file2 > file3

于 2012-04-18T13:22:47.613 回答

score 1 · Accepted Answer

我更喜欢 byrondrossos 的 grep 解决方案，但这里有另一种选择：

sed $(awk '{printf("-e /%s/d ", $1)}' file1) file2 > file3

score 0 · Accepted Answer

由于开关，这是使用Bash和GNU sed-i

cp file2 file3
while read -r; do
    sed -i "/$REPLY/d" file3
done < file1

肯定有更好的方法，但这里有一个技巧-i ：D

cp file2 file3
while read -r; do
    (rm file3; sed "/$REPLY/d" > file3) < file3
done < file1

这利用了外壳评估顺序

好吧，我想这个想法的正确方法是使用ed. 这也应该是 POSIX。

cp file2 file3
while read -r line; do
    ed file3 <<EOF
/$line/d
wq
EOF
done < file1

无论如何，grep似乎确实是适合这项工作的工具。
@byrondrossos 答案应该对你有用；）

score 0 · Accepted Answer

诚然，这很丑陋，但确实有效。但是，所有的路径必须相同（当然除了 ZNI### 部分）。除了 ZNI### 之外的所有路径都被删除，因此命令 grep -vf 可以在排序的文件上正确运行。

首先将“testfile2”转换为“testfileconverted”以仅显示 ZNI###

cat /testfile2 | sed 's:^.*_ZNI:ZNI:g' | sed 's:_.*::g' > /testfileconverted

第二次使用转换文件的反向 grep 与“testfile1”相比，并将重新格式化的输出添加到“testfile3”

bash -c 'grep -vf <(sort /testfileconverted) <(sort /testfile1)' | sed "s:^:\copy /Y \\\|server\\\foldername\\\version\\\20050001_:g" | sed "s:$:_162635\.xml \\\|server\\\foldername\\\version\\\folder\\\:g" | sed "s:|:\\\:g" > /testfile3

python - Compare 2 files and remove any lines in file2 when they match values found in file1

4 回答 4

Related

Reference