我正在尝试在 python 中过滤一个大文件中的重叠行。

重叠度设置为两行和其他两行的 25%。换句话说,重叠度是a*b/(c+d-a*b)>0.25a是第 1 行和第 3 行的交点数,b第2 行和第 4 行的c点数,是第 1 行的元素个数乘以元素个数第 2 行,d是第 3 行的元素数乘以第 4 行的元素数。如果重叠度大于 0.25,则删除第 3 行和第 4 行。因此,如果我有一个总共有 1000 000 行的大文件,那么前 6 行如下:

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c54 c75
k6 k12 k33 k63

由于第 1 两行和第 2 行的重叠度为a=3, (如c6,c24,c32), b=3, (如k6,k12,k63), c=25,d=24, a*b/(c+d-a*b)=9/40<0.25, 故第 3 行和第 4 行不被删除。接下来第一两行和第三两行的重叠度为5*4/(25+28-5*4)=0.61>0.25,删除第三两行。

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63


for i=1:(n-1)    # n is a half of the number of rows of the big file
    for j=(i+1):n  
        if  overlap degrees of the ith two rows and jth two rows is more than 0.25
          delete the jth two rows from the big file


with open("iuputfile.txt") as fileobj: 
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        for second_index in range(len(sets)-2 , first_index, -2):
            ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1])
            if (ab/(c+d-ab))>0.25:
                del sets[second_index]
                del sets[second_index+1]
with open("outputfile.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(set_)


如何修改该代码以在 Python 中解决此问题?谢谢!


2 回答 2


Stack Overflow is not here to program for you or to solve general debugging tasks. It's for specific problems that you've tried to solve yourself but can't. You're asking questions you should, as a programmer, be able to figure out yourself. Start your program like this:

python -m pdb my_script.py

Now you can step through your script line by line using the n command. If you want to see what's inside a variable, simply type the name of that variable. By using this method you will find out why things don't work yourself. There are lots of other smart things you can do using pdb (the python debugger) but for this case the n command is sufficient.

Please put in some more effort towards solving your problem yourself before asking another question here.

That being said, here is what was wrong with your modified script:

with open("iuputfile.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c = len(sets[first_index]) * len(sets[first_index + 1])
        for second_index in range(len(sets) - 2, first_index, -2):
            d = len(sets[second_index]) * len(sets[second_index + 1])
            # Error 1:
            ab = len(sets[first_index] & sets[second_index]) * \
                len(sets[first_index + 1] & sets[second_index + 1])
            # Error 2:
            overlap = (float(ab) / (c + d - ab))
            if overlap > 0.25:
                # Error 3:
                del sets[second_index + 1]
                del sets[second_index]
    with open("outputfile.txt", "w") as fileobj:
        for set_ in sets:
            # You've removed the sorting, I assume it's because the order
            # is unimportant
            output = " ".join(set_)

The mistakes were:

  • Error 1: Intersection is &. Union is |.
  • Error 2: Since all of the variables are integers, the result will be an integer too, unless you're using python 3. If you are, this is not an error. If you're not, you need to make sure one of the variables are a float to force the result to be a float as well. Hence the float(ab).
  • Error 3: Remember to always work from the back and forwards. When you delete sets[second_index], what used to be at sets[second_index + 1] takes it place, so deleting sets[second_index + 1] afterwards will delete what used to be at sets[second_index + 2], which is not what you want. So we delete the largest index first.
于 2013-06-28T08:20:11.437 回答



class SmartList(list):
    def __init__(self, *args, **kwargs):
        super(SmartList, self).__init__(*args, **kwargs)
        self.iterators = []

    def __iter__(self):
        return SmartListIter(self)

    def __delitem__(self, index):
        super(SmartList, self).__delitem__(index)
        for iterator in self.iterators:


class SmartListIter(object):
    def __init__(self, smartlist, index=0):
        self.smartlist = smartlist
        self.index = index

    def __iter__(self):
        return self

    def next(self):
            item = self.smartlist[self.index]
        except IndexError:
            raise StopIteration
        index = self.index
        self.index += 1
        return (index, item)

    def item_deleted(self, index):
        if index >= self.index:
        self.index -= 1


next方法返回一个元组(index, item)而不仅仅是项目,因为当需要使用这些类时,这会使事情变得更容易——我们不必搞乱enumerate.


class LinePair(object):
    def __init__(self, pair):
        self.pair = pair
        self.sets = [set(line.split()) for line in pair]
        self.c = len(self.sets[0]) * len(self.sets[1])

    def overlap(self, other):
        ab = float(len(self.sets[0] & other.sets[0]) * \
            len(self.sets[1] & other.sets[1]))
        overlap = ab / (self.c + other.c - ab)
        return overlap

    def __str__(self):
        return "".join(self.pair)

pair属性是直接从输入文件中读取的两行元组,并带有换行符。我们稍后使用它来将该对写回文件。我们还将两条线转换为一个集合并计算c属性,这是每对线的属性。最后我们创建一个方法来计算一个 LinePair 和另一个 LinePair 之间的重叠。请注意,d它已经消失了,因为这只是c另一对的属性。


from itertools import izip

with open("iuputfile.txt") as fileobj:
    pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])

for first_index, first_pair in pairs:
    for second_index, second_pair in SmartListIter(pairs, first_index + 1):
        if first_pair.overlap(second_pair) > 0.25:
            del pairs[second_index]

with open("outputfile.txt", "w") as fileobj:
    for index, pair in pairs:


于 2013-06-29T20:19:29.930 回答