python - 如何在python中过滤大文件中的重叠行

Question

我正在尝试在 python 中过滤一个大文件中的重叠行。重叠度设置为 25%。换句话说，任何两行之间的交集元素的数量小于它们并集的0.25倍。如果大于0.25，则删除一行。所以如果我有一个总共有1000 000行的大文件，第一个5行如下：

c6 c24 c32 c54 c67
c6 c24 c32 c51 c68 c78
c6 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75
c6 c32 c53 c67

因为第 1 行和第 2 行相交的元素个数为 3（如 c6,c24,c32），所以它们之间的并集数为 8，（如 c6,c24,c32,c54,c67,c51 ,c68,c78)。重叠度为3/8=0.375 > 0.25，第2行被删除。第3和第5行也是如此。最终答案是第1和第4行。

c6 c24 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75

伪代码如下：

for i=1:(n-1)    # n is the number of rows of the big file
    for j=(i+1):n  
        if  overlap degrees of the ith row and jth row is more than 0.25
          delete the jth row from the big file
        end
   end

结尾

如何在python中解决这个问题？谢谢！

score 1 · Accepted Answer

棘手的部分是您必须修改您正在迭代的列表并仍然跟踪两个索引。一种方法是倒退，因为删除索引等于或大于您跟踪的索引的项目不会影响它们。

此代码未经测试，但您明白了：

with open("file.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 2, -1, -1):
        for second_index in range(len(sets) - 1, first_index, -1):
            union = sets[first_index] | sets[second_index]
            intersection = sets[first_index] & sets[second_index]
            if len(intersection) / float(len(union)) > 0.25:
                del sets[second_index]
with open("output.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(sorted(set_, key=lambda x: int(x[1:])))
        fileobj.write("{0}\n".format(output))

由于很明显如何对每一行的元素进行排序，我们可以这样做。如果顺序是自定义的，我们必须将读取的行与每个 set 元素耦合，以便我们可以准确地写回最后读取的行，而不是重新生成它。

python - 如何在python中过滤大文件中的重叠行

1 回答 1

Related

Reference