python - 如何删除两个文件中相同的单词？

Question

我有两个文本文件。

file1.txt有：

gedit
google chrome
git
vim
foo
bar

file2.txt有：

firefox
svn
foo
vim

如何编写一个脚本，在执行时（使用file1.txt和file2.txt作为参数）检查每行中的文本重复（我的意思是它应该逐行处理），删除两个文件中的重复文本。

所以经过处理后，file1.txt和file2.txt都应该有以下内容：

gedit
google chrome
git
bar
firefox
svn

请注意foo和vim已从两个文件中删除。

有什么指导吗？

score 3 · Accepted Answer

with open('file1.txt','r+') as f1 ,open('file2.txt','r+') as f2:
    file1=set(x.strip() for x in f1 if x.strip())
    file2=set(x.strip() for x in f2 if x.strip())
    newfile=file1.symmetric_difference(file2) #symmetric difference removes those values which are present in both sets, and returns a new set.
    f2.truncate(0) #truncate the file to 0 bytes
    f1.truncate(0)
    f2.seek(0) # to push the cursor back to the starting pointing in the file.
    f1.seek(0)
    for x in newfile:
        f1.write(x+'\n')
        f2.write(x+'\n')

现在两个文件都包含：

svn
git
firefox
gedit
google chrome
bar

score 0 · Accepted Answer

您会将过滤后的文件另存为第三个文件吗？

无论如何，对每个文件进行 2 次循环，并将每个循环索引值与另一个进行比较，如果它们相等，则删除并相应地移动。伪代码：

Def func(File a, File b):
 for i in a: 
    for j in b:
      if a[i]==b[j]:
        copy and move
      endif
     endfor
  endfor

score 0 · Accepted Answer

如果我正确理解了您的问题，那应该很容易。

alist = []
for i in ifile1:
    alist.append(i)

for i in ifile2:
    if i in alist:
        alist.remove(i)
    else:
        alist.append(i)

for i in alist:
    print i

score 0 · Accepted Answer

如果文件相当小以适合内存，这将完成这项工作：

with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
    # create a set from the bigger file 
    result = set(x.strip() for x in f1.readlines())
    # remove duplicates or add unique values from 2nd file
    for line in f2:
        line = line.strip()
        if line in result:
            result.remove(line)
        else:
            result.add(line)
result = "\n".join(result)

# for debug, don't replace original files
with open("file1_out.txt", "w") as f1, open("file2_out.txt", "w") as f2:
    f1.write(result)
    f2.write(result)

# if not inside a function, free memory explicitly  
del result

score 0 · Accepted Answer

对于 Python 2.7+Counter的介绍

>>> from collections import Counter
>>> file_1 = ['gedit','google chrome','git','vim','foo','bar']
>>> file_2 = ['firefox','svn','foo','vim']
>>> de_dup = [i for i,c in Counter(file_1+file_2).itertimes() if c == 1]
>>> de_dup
['svn', 'git', 'bar', 'gedit', 'google chrome', 'firefox']

score -1 · Accepted Answer

让我们从输入文件名开始：

files = ('raz.txt','dwa.txt')

还有一些辅助功能。这是一个从文件中读取所有单词的生成器，

def read(filename):
    with open(filename) as f:
        for line in f:
            if len(line)>0:
                yield line.strip()

这会将一个序列写入文件。

def write(filename, lines):
    with open(filename, 'w') as f:
        f.write('\n'.join(lines))

所以让我们创建两个生成器 - 每个输入文件一个

words = [read(filename) for filename in files]

然后，让我们将生成器列表转换为集合列表

wordSets = map(set, words)

现在我们有一个包含 2 个集合的列表，其中仅包含每个文件中的唯一单词。

让我们通过相交它们的集合来创建另一个包含所有输入文件中存在的单词的集合：

commonWords = set.intersection(*wordSets)

以及重写的时间。

for filename in files:

由于我们想保存到完全相同的文件，不幸的是，我们需要先将其全部内容读入内存，然后再从那里写入。（如果您希望在不同文件中输出，则不必缓冲文件。

让我们创建一个阅读器生成器，然后通过包装将其全部读取到内存中list()：

    lines = list(read(filename))

然后按顺序将单词写回给定文件，但前提是它们不在 commonWords

    write(filename, (word for word in lines if word not in commonWords))

输入：

raz.txt

gedit
google chrome
git
vim
foo
bar

dwa.txt

firefox
svn
foo
vim

输出：

raz.txt

gedit
google chrome
git
bar

dwa.txt

firefox
svn

重复从两者中删除。

python - 如何删除两个文件中相同的单词？

6 回答 6

Related

Reference