python - python - 如何在没有代码重复的情况下迭代数据文件？

Question

我想写一个脚本来处理一些数据文件。数据文件只是带有数据列的 ascii 文本，这里是一个简单的例子......

第一列是 ID 号，在本例中为 1 到 3。第二列是感兴趣的值。（我使用的实际文件有更多的 ID 和值，但让我们在这里保持简单）。

data.txt 内容：

我想遍历数据并提取每个 ID 的值，然后处理它们，即获取 ID 1 的所有值并对它们进行处理，然后获取 ID 2 的所有值等。

所以我可以用python写这个。

#!/usr/bin/env python

def processValues(values):
  print "Will do something with data here: ", values

f = open('data.txt', 'r')
datalines = f.readlines()
f.close()

currentID = 0
first = True

for line in datalines:
    fields = line.split()

    # if we've moved onto a new ID,
    # then process the values we've collected so far
    if (fields[0] != currentID):

        # but if this is our first iteration, then
        # we just need to initialise our ID variable
        if (not first):
            processValues(values) # do something useful

        currentID = fields[0]
        values = []
        first = False

    values.append(fields[1])

processValues(values) # do something with the last values

我遇到的问题是processValues()最后必须再次调用。所以这需要重复代码，这意味着有一天我可能会写一个这样的脚本而忘记把多余processValues()的放在最后，因此会错过最后一个 ID。它还需要存储它是否是我们的“第一次”迭代，这很烦人。

无论如何都可以在没有两个函数调用的情况下执行此操作processValues()（一个在每个新 ID 的循环内，一个在最后一个 ID 的循环之后）？

我能想到的唯一方法是存储行号并在最后一行检查循环。但似乎删除了我们存储行本身的“foreach”样式处理的点，而不是索引或总行数。这也适用于 perl 等其他脚本语言，在这种语言中，迭代行很常见，但while(<FILE>)不知道剩余的行数。是否总是需要在最后再次编写函数调用？

score 3 · Accepted Answer

如果所有出现的键都是连续的，您想查看itertools.groupby - 一个基本示例......

from itertools import groupby
from operator import itemgetter

with open('somefile.txt') as fin:
    lines = ( line.split() for line in fin )
    for key, values in groupby(lines, itemgetter(0)):
        print 'Key', key, 'has values'
        for value in values:
            print value

或者 - 您也可以查看使用collections.defaultdict和 alist作为默认值。

score 1 · Accepted Answer

它loadtxt()可能是这样的：

from numpy import loadtxt

data = loadtxt("data.txt")
ids = unique(data[:,0]).astype(int)

for id in ids:
    d = data[ data[:,0] == id ] 
    # d is a reduced (matrix) containing data for <id>
    # ....... 
    # do some stuff with d

对于您的示例print d将给出：

id= 1 
d=
[[  1.   5.]
 [  1.   4.]
 [  1.  10.]
 [  1.  19.]]
id= 2 
d=
[[  2.  15.]
 [  2.  18.]
 [  2.  20.]
 [  2.  21.]]
id= 3 
d=
[[  3.  50.]
 [  3.  52.]
 [  3.  55.]
 [  3.  70.]]

python - python - 如何在没有代码重复的情况下迭代数据文件？

2 回答 2

Related

Reference