0

我有三个文本文件:

文件A:

13  abc
123 def
234 ghi
1234    jkl
12  mno

文件B:

12  abc
12  def
34  qwe
43  rty
45  mno

文件C:

12  abc
34  sdg
43  yui
54  poi
54  def

我想看看第二列中的所有值在文件之间匹配。如果第二列已排序,则以下代码有效。但是如果第二列未排序,我如何对第二列进行排序并比较文件?

fileA = open("A.txt",'r')
fileB = open("B.txt",'r')
fileC = open("C.txt",'r')

listA1 = []
for line1 in fileA:
    listA = line1.split('\t')
    listA1.append(listA)


listB1 = []
for line1 in fileB:
    listB = line1.split('\t')
    listB1.append(listB)


listC1 = []
for line1 in fileC:
    listC = line1.split('\t')
    listC1.append(listC)

for key1 in listA1:
    for key2 in listB1:
        for key3 in listC1:
            if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]:
                print "Common between three files:",key1[1]

print "Common between file1 and file2 files:"
for key1 in listA1:
    for key2 in listB1:
        if key1[1] == key2[1]:
            print key1[1]

print "Common between file1 and file3 files:"
for key1 in listA1:
    for key2 in listC1:
        if key1[1] == key2[1]:
            print key1[1]
4

1 回答 1

3

如果您只想按第二列对、 和进行排序A1,这很容易:B1C1

listA1.sort(key=operator.itemgetter(1))

如果你不明白itemgetter,这是一样的:

listA1.sort(key=lambda element: element[1])

但是,我认为更好的解决方案是只使用set

setA1 = set(element[1] for element in listA1)
setB1 = set(element[1] for element in listB1)
setC1 = set(element[1] for element in listC1)

或者,更简单地说,不要一开始就建立列表;做这个:

setA1 = set()
for line1 in fileA:
    listA = line1.split('\t')
    setA1.add(listA[1])

无论哪种方式:

print "Common between file1 and file2 files:"
for key in setA1 & setA2:
    print key

为了进一步简化它,您可能希望首先将重复的内容重构为函数:

def read_file(path):
    with open(path) as f:
        result = set()
        for line in f:
            columns = line.split('\t')
            result.add(columns[1])
    return result

setA1 = read_file('A.txt')
setB1 = read_file('B.txt')
setC1 = read_file('C.txt')

然后你可以找到更多的机会。例如:

def read_file(path):
    with open(path) as f:
        return set(row[1] for row in csv.reader(f))

正如 John Clements 指出的那样,您甚至不需要所有三个都是集合,只需 A1,因此您可以这样做:

def read_file(path):
    with open(path) as f:
        for row in csv.reader(f):
            yield row[1]

setA1 = set(read_file('A.txt'))
iterB1 = read_file('B.txt')
iterC1 = read_file('B.txt')

您需要的唯一其他更改是您必须调用intersection而不是使用&运算符,因此:

for key in setA1.intersection(iterB1):

我不确定最后一次更改是否真的是一种改进。但是在 Python 3.3 中,您唯一需要做的就是更改return set(…)into yield from (…),我可能会这样做。(即使文件很大并且有大量重复项,因此会产生性能成本,我还是会坚持调用周围的食谱unique_everseen。 )itertoolsread_file

于 2013-03-29T20:22:55.153 回答