2

我有以下格式的一些数据:

data = """

[Data-0]
Data = BATCH
BatProtocol = DIAG-ST
BatCreate = 20010724

[Data-1]
Data = SAMP
SampNum = 357
SampLane = 1

[Data-2]
Data = SAMP
SampNum = 357
SampLane = 2

[Data-9]
Data = BATCH
BatProtocol = VCA
BatCreate = 20010725

[Data-10]
Data = SAMP
SampNum = 359
SampLane = 1

[Data-11]
Data = SAMP
SampNum = 359
SampLane = 2

"""

结构是:

  1. [Data-x]其中 x 是一个数字
  2. Data =其次是BATCHSAMPLE
  3. 多几行

我正在尝试编写一个为每个“批次”生成一个列表的函数。列表的第一项是包含该行的文本块,Data = BATCH列表中的以下项是包含该行的文本块Data = SAMP。我目前有

def get_batches(data):
    textblocks = iter([txt for txt in data.split('\n\n') if txt.strip()])
    batch = []
    sample = next(textblocks)
    while True:
        if 'BATCH' in sample:
            batch.append(sample)
        sample = next(textblocks)
        if 'BATCH' in sample:
            yield batch
            batch = []
        else:
            batch.append(sample)

如果这样调用:

batches = get_batches(data)
for batch in batches:
    print batch
    print '_' * 20

但是,它只返回第一个“批次”:

['[Data-0]\nData = BATCH\nBatProtocol = DIAG-ST\nBatCreate = 20010724', 
 '[Data-1]\nData = SAMP\nSampNum = 357\nSampLane = 1', 
 '[Data-2]\nData = SAMP\nSampNum = 357\nSampLane = 2']
____________________

而我的预期输出将是:

['[Data-0]\nData = BATCH\nBatProtocol = DIAG-ST\nBatCreate = 20010724', 
 '[Data-1]\nData = SAMP\nSampNum = 357\nSampLane = 1', 
 '[Data-2]\nData = SAMP\nSampNum = 357\nSampLane = 2']
____________________
['[Data-9]\nData = BATCH\nBatProtocol = VCA\nBatCreate = 20010725', 
 '[Data-10]\nData = SAMP\nSampNum = 359\nSampLane = 1', 
 '[Data-11]\nData = SAMP\nSampNum = 359\nSampLane = 2']
____________________

我缺少什么或如何改进我的功能?

4

2 回答 2

6

当您找到下一批的开始时,您只会产生一批,因此您永远不会包含最后一批数据。要解决此问题,您将需要在函数结束时使用以下内容:

if batch:
    yield batch

然而,仅仅这样做是行不通的。最终next(textblocks)循环内部会引发一个StopIteration所以while循环之后没有代码可以执行。这是一种只需对当前代码进行微小更改即可使其工作的方法(请参阅下文以获得更好的版本):

def get_batches(data):
    textblocks = iter([txt for txt in data.split('\n\n') if txt.strip()])
    batch = []
    sample = next(textblocks)
    while True:
        if 'BATCH' in sample:
            batch.append(sample)
        try:
            sample = next(textblocks)
        except StopIteration:
            break
        if 'BATCH' in sample:
            yield batch
            batch = []
        else:
            batch.append(sample)
    if batch:
        yield batch

我建议只textblocks用一个for循环来循环:

def get_batches(data):
    textblocks = (txt for txt in data.split('\n\n') if txt.strip())
    batch = []
    for sample in textblocks:
        if 'BATCH' in sample:
            if batch:
                yield batch
            batch = []
        batch.append(sample)
    if batch:
        yield batch
于 2013-04-09T19:34:17.243 回答
2

正如@FJ 解释的那样,您的代码的真正问题是您没有产生最后一个值。但是,还可以进行其他改进,其中一些可以更轻松地解决最后值问题。

在我第一次看你的代码时,最让我印象深刻的是两个if检查 for 的语句'BATCH' in sample,它们可以合并为一个。

这是一个这样做的版本,以及for在生成器上使用循环,而不是while True

def get_batches(data):
    textblocks = (txt for txt in data.split('\n\n') if txt.strip())
    batch = [next(textblocks)]
    for sample in textblocks:
        if 'BATCH' in sample:
            yield batch
            batch = []
        batch.append(sample)
    yield batch

我在batch最后无条件让步,因为没有任何情况可以让你用batchempty 到达那里(如果data是空的,开始附近的初始化batch将 raise StopIteration)。

于 2013-04-09T19:47:11.480 回答