python - 如何在python中过滤大文件中两行的重叠

Question

我正在尝试在 python 中过滤一个大文件中的重叠行。

重叠度设置为两行和其他两行的 25%。换句话说，重叠度是a*b/(c+d-a*b)>0.25，a是第 1 行和第 3 行的交点数，是b第2 行和第 4 行的交c点数，是第 1 行的元素个数乘以元素个数第 2 行，d是第 3 行的元素数乘以第 4 行的元素数。如果重叠度大于 0.25，则删除第 3 行和第 4 行。因此，如果我有一个总共有 1000 000 行的大文件，那么前 6 行如下：

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c54 c75
k6 k12 k33 k63

由于第 1 两行和第 2 行的重叠度为a=3, (如c6,c24,c32), b=3, (如k6,k12,k63), c=25,d=24, a*b/(c+d-a*b)=9/40<0.25, 故第 3 行和第 4 行不被删除。接下来第一两行和第三两行的重叠度为5*4/(25+28-5*4)=0.61>0.25，删除第三两行。
最后的答案是第一和第二两行。

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63

伪代码如下：

for i=1:(n-1)    # n is a half of the number of rows of the big file
    for j=(i+1):n  
        if  overlap degrees of the ith two rows and jth two rows is more than 0.25
          delete the jth two rows from the big file
        end
    end
end

python代码如下。但它是错误的。如何解决？

with open("iuputfile.txt") as fileobj: 
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c=len(sets[first_index])*len(sets[first_index+1])
        for second_index in range(len(sets)-2 , first_index, -2):
            d=len(sets[second_index])*len(sets[second_index+1])
            ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1])
            if (ab/(c+d-ab))>0.25:
                del sets[second_index]
                del sets[second_index+1]
with open("outputfile.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(set_)
        fileobj.write("{0}\n".format(output))

类似的问题可以在https://stackoverflow.com/questions/17321275/中找到

如何修改该代码以在 Python 中解决此问题？谢谢！

score 1 · Accepted Answer

Stack Overflow is not here to program for you or to solve general debugging tasks. It's for specific problems that you've tried to solve yourself but can't. You're asking questions you should, as a programmer, be able to figure out yourself. Start your program like this:

python -m pdb my_script.py

Now you can step through your script line by line using the n command. If you want to see what's inside a variable, simply type the name of that variable. By using this method you will find out why things don't work yourself. There are lots of other smart things you can do using pdb (the python debugger) but for this case the n command is sufficient.

Please put in some more effort towards solving your problem yourself before asking another question here.

That being said, here is what was wrong with your modified script:

with open("iuputfile.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c = len(sets[first_index]) * len(sets[first_index + 1])
        for second_index in range(len(sets) - 2, first_index, -2):
            d = len(sets[second_index]) * len(sets[second_index + 1])
            # Error 1:
            ab = len(sets[first_index] & sets[second_index]) * \
                len(sets[first_index + 1] & sets[second_index + 1])
            # Error 2:
            overlap = (float(ab) / (c + d - ab))
            if overlap > 0.25:
                # Error 3:
                del sets[second_index + 1]
                del sets[second_index]
    with open("outputfile.txt", "w") as fileobj:
        for set_ in sets:
            # You've removed the sorting, I assume it's because the order
            # is unimportant
            output = " ".join(set_)
            fileobj.write("{0}\n".format(output))

The mistakes were:

Error 1: Intersection is &. Union is |.
Error 2: Since all of the variables are integers, the result will be an integer too, unless you're using python 3. If you are, this is not an error. If you're not, you need to make sure one of the variables are a float to force the result to be a float as well. Hence the float(ab).
Error 3: Remember to always work from the back and forwards. When you delete sets[second_index], what used to be at sets[second_index + 1] takes it place, so deleting sets[second_index + 1] afterwards will delete what used to be at sets[second_index + 2], which is not what you want. So we delete the largest index first.

score 1 · Accepted Answer

我一直在考虑如何以更好的方式解决这个问题，而不需要所有的逆向和索引之类的东西，我想出了一个更长、更复杂、但更容易阅读、更漂亮、更易于维护和可扩展，恕我直言。

首先，我们需要一种可以“正确”迭代的特殊列表，即使其中的项目被删除。这是一篇博客文章，详细介绍了列表和迭代器的工作原理，阅读它将帮助您了解这里发生了什么：

class SmartList(list):
    def __init__(self, *args, **kwargs):
        super(SmartList, self).__init__(*args, **kwargs)
        self.iterators = []

    def __iter__(self):
        return SmartListIter(self)

    def __delitem__(self, index):
        super(SmartList, self).__delitem__(index)
        for iterator in self.iterators:
            iterator.item_deleted(index)

我们扩展内置list并使其返回自定义迭代器而不是默认值。每当列表中的项目被删除时，我们都会调用列表item_deleted中每个项目的方法self.iterators。这是代码SmartListIter：

class SmartListIter(object):
    def __init__(self, smartlist, index=0):
        self.smartlist = smartlist
        smartlist.iterators.append(self)
        self.index = index

    def __iter__(self):
        return self

    def next(self):
        try:
            item = self.smartlist[self.index]
        except IndexError:
            self.smartlist.iterators.remove(self)
            raise StopIteration
        index = self.index
        self.index += 1
        return (index, item)

    def item_deleted(self, index):
        if index >= self.index:
            return
        self.index -= 1

所以迭代器将自己添加到迭代器列表中，并在完成后将自己从同一个列表中删除。如果一个索引小于当前索引的项目被删除，我们将当前索引减一，这样我们就不会像普通列表迭代器那样跳过一个项目。

该next方法返回一个元组(index, item)而不仅仅是项目，因为当需要使用这些类时，这会使事情变得更容易——我们不必搞乱enumerate.

所以这应该注意必须倒退，但我们仍然必须使用大量索引在每个循环中的四个不同行之间进行处理。既然两行和两行在一起，让我们为此创建一个类：

class LinePair(object):
    def __init__(self, pair):
        self.pair = pair
        self.sets = [set(line.split()) for line in pair]
        self.c = len(self.sets[0]) * len(self.sets[1])

    def overlap(self, other):
        ab = float(len(self.sets[0] & other.sets[0]) * \
            len(self.sets[1] & other.sets[1]))
        overlap = ab / (self.c + other.c - ab)
        return overlap

    def __str__(self):
        return "".join(self.pair)

该pair属性是直接从输入文件中读取的两行元组，并带有换行符。我们稍后使用它来将该对写回文件。我们还将两条线转换为一个集合并计算c属性，这是每对线的属性。最后我们创建一个方法来计算一个 LinePair 和另一个 LinePair 之间的重叠。请注意，d它已经消失了，因为这只是c另一对的属性。

现在进入大结局：

from itertools import izip

with open("iuputfile.txt") as fileobj:
    pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])

for first_index, first_pair in pairs:
    for second_index, second_pair in SmartListIter(pairs, first_index + 1):
        if first_pair.overlap(second_pair) > 0.25:
            del pairs[second_index]

with open("outputfile.txt", "w") as fileobj:
    for index, pair in pairs:
        fileobj.write(str(pair))

注意在这里阅读中心循环是多么容易，而且它是多么短。如果您将来需要更改此算法，使用此代码可能比使用我的其他代码更容易完成。izip用于对输入文件的两行和两行进行分组，如此处所述。

python - 如何在python中过滤大文件中两行的重叠

2 回答 2

Related

Reference