1

我编写了以下代码来定义文本文件中的 4 行块,如果块的第 2 行仅由一种类型的字符组成,则输出该块。假设(并且之前已验证)第二行始终由 36 个字符的字符串组成。

# filter out homogeneous reads

import sys
import collections
from collections import Counter

filename1 = sys.argv[1] # file to process

with open(filename1,'r') as input_file:
    for line1 in input_file:
        line2, line3, line4 = [next(input_file) for line in xrange(3)]
        c = Counter(line2).values() # count characters in line2
        c.sort(reverse=True) # sort values in descending order
        if c[0] < 36:
            print line1 + line2 + line3 + line4.rstrip()

但是,我收到如下 StopIteration 错误。如果有人能告诉我原因,我将不胜感激。

$ python code.py test.file > testout.file
Traceback (most recent call last):
  File "code.py", line 11, in <module>
    line2, line3, line4 = [next(input_file) for line in xrange(3)]
StopIteration

任何帮助将不胜感激,尤其是那些解释我的特定代码有什么问题以及如何修复它的帮助。这是一个输入示例:

@1:1:1323:1032:Y
AGCAGCATTGTACAGGGCTATCATGGAATTCTCGGG
+1:1:1323:1032:Y
HHHBHHBHBHGBGGGH8HHHGGGGFHBHHHHBHHHH
@1:1:1610:1033:Y
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+1:1:1610:1033:Y
HHEHHHHHHHHHHHBGGD>GGD@G8GGGGDHBHH4C
@1:1:1679:1032:Y
CGGTGGATCACTCGGCTCGTGCGTCGATGAAGAACG
4

3 回答 3

2

Your example input already shows the problem: You have 10 lines there, which is not divisble by 4. So as you read the very last block, you get line1 and line2 but for the next() call for line3, the input is exhausted and you get nothing.

It’s likely that you have the same issue in your full input file as well: The number of lines is simply not divisible by 4.

There are a few ways to overcome this. The best is probably to fix your input since you seem to be expecting four lines all the way, there seems to be a content problem if that’s not what the input file gives.

Another very simple fix would be to specify a default value with next():

line2, line3, line4 = [next(input_file, '') for line in xrange(3)]

Now, when next() would fail, the default value '' is instead returned. So even if the file is exhausted, you still get some content back.

A probably better solution however would be to fix the way you iterate the file. You have two locations where you access the same file iterator, once in the outer for loop and three times in the list comprehension. It may seem simple enough so you won’t run into other problems, but you should really try to change this so that you only have a single location where you walk through the iterator; or only ever use next() calls, but mixing it with a for loop seems like a bad idea.

You could for example use the grouper itertools recipe to cleanly iterate the file in groups of four:

with open(filename1, 'r') as input_file:
    for line1, line2, line3, line4 in grouper(input_file, 4, fillvalue=''):
        # do things with the lines
于 2015-12-23T11:13:47.727 回答
1

You will get this if the number of lines in your file cannot by divided by 4 without remainder. Then you will try read a line that does not exist. You need to count empty lines.

One solution would be to stop processing the file if the number of lines is not enough for processing:

try:
    line2, line3, line4 = [next(input_file) for line in xrange(3)]
except StopIteration:
    break

This feels a bit cleaner:

while True:
    try:
        line1, line2, line3, line4 = [next(input_file) for line in xrange(4)]
except StopIteration:
    break

because you progress the iterator only at one place not at two.

于 2015-12-23T11:13:01.940 回答
1

你有10线路,所以它可以迭代2次数,然后2线路短缺。这是 Python 无法读取足够多的行并抛出错误的地方StopIteration

签出这段代码,我稍微更新了一下:

import sys
import collections
from collections import Counter

filename1 = sys.argv[1] # file to process

with open(filename1,'r') as input_file:
    while True:
        try:
            line1, line2, line3, line4 = [next(input_file) for line in xrange(4)]
        except StopIteration:
            print "Not enough lines to read!"
            break

        c = Counter(line2).values() # count characters in line2
        c.sort(reverse=True) # sort values in descending order
        if c[0] < 36:
            print line1 + line2 + line3 + line4.rstrip()
        else:
            print "Skipping 4 lines since less than 36 characters"
于 2015-12-23T11:22:34.980 回答