python - 查找文件之间的公共列表

Question

我有三个文本文件：

文件A：

13  abc
123 def
234 ghi
1234    jkl
12  mno

文件B：

12  abc
12  def
34  qwe
43  rty
45  mno

文件C：

12  abc
34  sdg
43  yui
54  poi
54  def

我想看看第二列中的所有值在文件之间匹配。如果第二列已排序，则以下代码有效。但是如果第二列未排序，我如何对第二列进行排序并比较文件？

fileA = open("A.txt",'r')
fileB = open("B.txt",'r')
fileC = open("C.txt",'r')

listA1 = []
for line1 in fileA:
    listA = line1.split('\t')
    listA1.append(listA)


listB1 = []
for line1 in fileB:
    listB = line1.split('\t')
    listB1.append(listB)


listC1 = []
for line1 in fileC:
    listC = line1.split('\t')
    listC1.append(listC)

for key1 in listA1:
    for key2 in listB1:
        for key3 in listC1:
            if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]:
                print "Common between three files:",key1[1]

print "Common between file1 and file2 files:"
for key1 in listA1:
    for key2 in listB1:
        if key1[1] == key2[1]:
            print key1[1]

print "Common between file1 and file3 files:"
for key1 in listA1:
    for key2 in listC1:
        if key1[1] == key2[1]:
            print key1[1]

score 3 · Accepted Answer

如果您只想按第二列对、和进行排序A1，这很容易：B1C1

listA1.sort(key=operator.itemgetter(1))

如果你不明白itemgetter，这是一样的：

listA1.sort(key=lambda element: element[1])

但是，我认为更好的解决方案是只使用set：

setA1 = set(element[1] for element in listA1)
setB1 = set(element[1] for element in listB1)
setC1 = set(element[1] for element in listC1)

或者，更简单地说，不要一开始就建立列表；做这个：

setA1 = set()
for line1 in fileA:
    listA = line1.split('\t')
    setA1.add(listA[1])

无论哪种方式：

print "Common between file1 and file2 files:"
for key in setA1 & setA2:
    print key

为了进一步简化它，您可能希望首先将重复的内容重构为函数：

def read_file(path):
    with open(path) as f:
        result = set()
        for line in f:
            columns = line.split('\t')
            result.add(columns[1])
    return result

setA1 = read_file('A.txt')
setB1 = read_file('B.txt')
setC1 = read_file('C.txt')

然后你可以找到更多的机会。例如：

def read_file(path):
    with open(path) as f:
        return set(row[1] for row in csv.reader(f))

正如 John Clements 指出的那样，您甚至不需要所有三个都是集合，只需 A1，因此您可以这样做：

def read_file(path):
    with open(path) as f:
        for row in csv.reader(f):
            yield row[1]

setA1 = set(read_file('A.txt'))
iterB1 = read_file('B.txt')
iterC1 = read_file('B.txt')

您需要的唯一其他更改是您必须调用intersection而不是使用&运算符，因此：

for key in setA1.intersection(iterB1):

我不确定最后一次更改是否真的是一种改进。但是在 Python 3.3 中，您唯一需要做的就是更改return set(…)into yield from (…)，我可能会这样做。（即使文件很大并且有大量重复项，因此会产生性能成本，我还是会坚持调用周围的食谱unique_everseen。）itertoolsread_file

python - 查找文件之间的公共列表

1 回答 1

Related

Reference