1

我想处理一个文本文件(逐行)。一个(最初未知)数量的连续行属于同一个实体(即它们带有与行相同的标识符)。例如:

line1: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line2: stuff, stuff2, stuff3, ID1, stuff4, stuff5    
line3: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line4: stuff, stuff2, stuff3, ID2, stuff4, stuff5
line5: stuff, stuff2, stuff3, ID2, stuff4, stuff5
...

在这个虚拟行中,第 1-3 行属于实体 ID1,第 4-5 行属于 ID2。我想将这些行中的每一行作为字典读取,然后将它们嵌套到包含 IDX 的所有字典的字典中(例如,字典 ID1 分别具有 3 个嵌套字典的第 1-3 行)。

更具体地说,我想定义一个函数:

  1. 打开文件
  2. 将实体 ID1 的所有(但仅)行读取到单独的字典中
  3. 返回包含 ID1 行的嵌套字典的字典

我希望能够在稍后再次调用该函数,以读取以下标识符(ID2)和后来的 ID3 等所有行的下一个字典。我遇到的一个问题是我需要在每个line 我当前的线路是否仍然带有感兴趣的 ID 或已经是新的。如果它是新的,我当然可以停止并返回字典,但在下一轮(比如 ID2)中,ID2 的第一行已经被读取,因此我似乎丢失了那一行。

换句话说:一旦遇到具有新 ID 的行,我想以某种方式重置函数中的计数器,以便在下一次迭代中,具有新 ID 的第一行不会丢失。

这似乎是一项简单的任务,但我想不出一种优雅的方法。我目前在函数之间传递一些“内存”标志/变量,以跟踪新 ID 的第一行是否已在前一次迭代中读取。这是相当庞大且容易出错的。

感谢您阅读...任何想法/提示都将受到高度赞赏。如果有些地方不清楚,请询问。

这是我的“解决方案”。从某种意义上说,它似乎可以正确打印字典(尽管我确信有一种更优雅的方法可以做到这一点)。我还忘了提到文本文件非常大,因此我想按 ID 处理它,而不是将整个文件读入内存。

with open(infile, "r") as f:
    newIDLine = None
    for line in f:
        if not line:
            break
        # the following function returns the ID
        ID = get_ID_from_line(line)
        counter = 1
        ID_Dic = dict()
        # if first line is completely new (i.e. first line in infile)
        if newIDLine is None:
            currID = ID
            # the following function returns the line as a dic
            ID_Dic[counter] = process_line(line)
        # if first line of new ID was already read in
        # the previous "while" iteration (see below).
        if newIDLine is not None:
            # if the current "line" is of the same ID then the
            # previous one: put previous and current line in
            # the same dic and start the while loop.
            if ID == oldID:
                ID_Dic[counter] = process_line(newIDLine)
                counter += 1
                ID_Dic[counter] = process_line(line)
                currID = ID
        # iterate over the following lines until file end or
        # new ID starts. In the latter case: keep the info in
        # objects newIDline and oldID
        while True:
            newLine = next(f)
            if not newLine:
                break
            ID = get_ID_from_line(newLine)
            if ID == currID:
                counter += 1
                ID_Dic[counter] = process_line(newLine)
            # new ID; save line for the upcomming ID dic
            if not ID == currID:
                newIDLine = newLine
                oldID = ID
                break
    # at this point it would be great to return the Dictionary of
    # the current ID to the calling function but at return to this
    # function continue where I left off.
    print ID_Dic
4

2 回答 2

1

您可以使用字典来跟踪所有IDX列,只需将每一行的IDX列添加到字典中的适当列表中,例如:

from collections import defaultdict
import csv

all_lines_dict = defaultdict(list)

with open('your_file') as f:
  csv_reader = csv.reader(f)            
  for line_list in csv_reader:
    all_lines_dict[line_list[3]].append(line_list)

csv阅读器是python标准库的一部分,使阅读csv文件变得容易。它将每一行读取为其列的列表。

这与您的要求不同,因为每个键不是字典字典,而是共享该IDX键的行的列表。

于 2013-07-15T13:33:14.533 回答
1

如果您希望此函数为每个 id 延迟返回一个 dict,则应使用 yield 而不是 return 使其成为生成器函数。在每个 id 的末尾,产生该 id 的字典。然后你可以迭代那个生成器。

To handle the file, write a generator function that iterates over a source unless you send it a value, in which case it returns that value next, then goes back to iterating. (For example, here's a module I wrote to do this for myself: politer.py.)

Then you can solve this problem easily by sending the value "back" if you don't want it:

with open(infile, 'r') as f:
    polite_f = politer(f)
    current_id = None
    while True:
        id_dict = {}
        for i, line in enumerate(polite_f):
            id = get_id_from_line(line)
            if id != current_id:
                polite_f.send(line)
                break
            else:
                id_dict[i] = process_line(line)
        if current_id is not None:
            yield id_dict
        current_id = id

Note that this keeps the state handling abstracted in the generator where it belongs.

于 2013-07-15T15:04:32.090 回答