python - 从嵌套字典中的文件中读取最初未知数量的 N 行，并在第 N+1 行开始下一次迭代

Question

我想处理一个文本文件（逐行）。一个（最初未知）数量的连续行属于同一个实体（即它们带有与行相同的标识符）。例如：

line1: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line2: stuff, stuff2, stuff3, ID1, stuff4, stuff5    
line3: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line4: stuff, stuff2, stuff3, ID2, stuff4, stuff5
line5: stuff, stuff2, stuff3, ID2, stuff4, stuff5
...

在这个虚拟行中，第 1-3 行属于实体 ID1，第 4-5 行属于 ID2。我想将这些行中的每一行作为字典读取，然后将它们嵌套到包含 IDX 的所有字典的字典中（例如，字典 ID1 分别具有 3 个嵌套字典的第 1-3 行）。

更具体地说，我想定义一个函数：

打开文件
将实体 ID1 的所有（但仅）行读取到单独的字典中
返回包含 ID1 行的嵌套字典的字典

我希望能够在稍后再次调用该函数，以读取以下标识符（ID2）和后来的 ID3 等所有行的下一个字典。我遇到的一个问题是我需要在每个line 我当前的线路是否仍然带有感兴趣的 ID 或已经是新的。如果它是新的，我当然可以停止并返回字典，但在下一轮（比如 ID2）中，ID2 的第一行已经被读取，因此我似乎丢失了那一行。

换句话说：一旦遇到具有新 ID 的行，我想以某种方式重置函数中的计数器，以便在下一次迭代中，具有新 ID 的第一行不会丢失。

这似乎是一项简单的任务，但我想不出一种优雅的方法。我目前在函数之间传递一些“内存”标志/变量，以跟踪新 ID 的第一行是否已在前一次迭代中读取。这是相当庞大且容易出错的。

感谢您阅读...任何想法/提示都将受到高度赞赏。如果有些地方不清楚，请询问。

这是我的“解决方案”。从某种意义上说，它似乎可以正确打印字典（尽管我确信有一种更优雅的方法可以做到这一点）。我还忘了提到文本文件非常大，因此我想按 ID 处理它，而不是将整个文件读入内存。

with open(infile, "r") as f:
    newIDLine = None
    for line in f:
        if not line:
            break
        # the following function returns the ID
        ID = get_ID_from_line(line)
        counter = 1
        ID_Dic = dict()
        # if first line is completely new (i.e. first line in infile)
        if newIDLine is None:
            currID = ID
            # the following function returns the line as a dic
            ID_Dic[counter] = process_line(line)
        # if first line of new ID was already read in
        # the previous "while" iteration (see below).
        if newIDLine is not None:
            # if the current "line" is of the same ID then the
            # previous one: put previous and current line in
            # the same dic and start the while loop.
            if ID == oldID:
                ID_Dic[counter] = process_line(newIDLine)
                counter += 1
                ID_Dic[counter] = process_line(line)
                currID = ID
        # iterate over the following lines until file end or
        # new ID starts. In the latter case: keep the info in
        # objects newIDline and oldID
        while True:
            newLine = next(f)
            if not newLine:
                break
            ID = get_ID_from_line(newLine)
            if ID == currID:
                counter += 1
                ID_Dic[counter] = process_line(newLine)
            # new ID; save line for the upcomming ID dic
            if not ID == currID:
                newIDLine = newLine
                oldID = ID
                break
    # at this point it would be great to return the Dictionary of
    # the current ID to the calling function but at return to this
    # function continue where I left off.
    print ID_Dic

score 1 · Accepted Answer

您可以使用字典来跟踪所有IDX列，只需将每一行的IDX列添加到字典中的适当列表中，例如：

from collections import defaultdict
import csv

all_lines_dict = defaultdict(list)

with open('your_file') as f:
  csv_reader = csv.reader(f)            
  for line_list in csv_reader:
    all_lines_dict[line_list[3]].append(line_list)

csv阅读器是python标准库的一部分，使阅读csv文件变得容易。它将每一行读取为其列的列表。

这与您的要求不同，因为每个键不是字典字典，而是共享该IDX键的行的列表。

score 1 · Accepted Answer

如果您希望此函数为每个 id 延迟返回一个 dict，则应使用 yield 而不是 return 使其成为生成器函数。在每个 id 的末尾，产生该 id 的字典。然后你可以迭代那个生成器。

To handle the file, write a generator function that iterates over a source unless you send it a value, in which case it returns that value next, then goes back to iterating. (For example, here's a module I wrote to do this for myself: politer.py.)

Then you can solve this problem easily by sending the value "back" if you don't want it:

with open(infile, 'r') as f:
    polite_f = politer(f)
    current_id = None
    while True:
        id_dict = {}
        for i, line in enumerate(polite_f):
            id = get_id_from_line(line)
            if id != current_id:
                polite_f.send(line)
                break
            else:
                id_dict[i] = process_line(line)
        if current_id is not None:
            yield id_dict
        current_id = id

Note that this keeps the state handling abstracted in the generator where it belongs.

python - 从嵌套字典中的文件中读取最初未知数量的 N 行，并在第 N+1 行开始下一次迭代

2 回答 2

Related

Reference