python - 在 Python 中延迟解析有状态的、多行的每条记录数据流？

Question

这是一个文件的外观：

BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    data I
    wish to
    extract
 END_DB

我希望能够解析它们所有的无限流cat，这排除了做类似re.findall('something useful', '\n'.join(sys.stdin), re.M).

以下是我的尝试，但我必须强制从其中返回生成器，get_raw_table()因此它不太符合要求。去掉力意味着你无法测试返回的生成器是否为空，所以你看不到是否sys.stdin为空。

def get_raw_table(it):
    state = 'begin'
    for line in it:
        if line.startswith('BEGIN_DB'):
            state = 'discard'
        elif line.startswith('END_DB'):
            return
        elif state is 'discard' and not line.strip():
            state = 'take'
        elif state is 'take' and line:
            yield line.strip().strip('#').split()

# raw_tables is a list (per file) of lists (per row) of lists (per column)
raw_tables = []
while True:
    result = list(get_raw_table(sys.stdin))
    if result:
        raw_tables.append(result)
    else:
        break

score 4 · Accepted Answer

像这样的东西可能会起作用：

import itertools

def chunks(it):
    while True:
        it = itertools.dropwhile(lambda x: 'BEGIN_DB' not in x, it)
        it = itertools.dropwhile(lambda x: x.strip(), it)
        next(it)
        yield itertools.takewhile(lambda x: 'END_DB' not in x, it)

例如：

src = """
BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    1data I
    1wish to
    1extract
 END_DB


BEGIN_META
    stuff
    to
    discard
END_META
BEGIN_DB
    header
    to
    discard

    2data I
    2wish to
    2extract
 END_DB
"""


src = iter(src.splitlines())
for chunk in chunks(src):
    for line in chunk:
        print line.strip()
    print

score 1 · Accepted Answer

您可以更以编程方式分离函数，以使您的编程逻辑更有意义，并使您的代码更加模块化和灵活。尽量不要说类似的话

state = "some string"

因为如果将来你想向这个模块添加一些东西会发生什么，那么你需要知道你的变量“状态”采用什么参数以及当它改变值时会发生什么。您不能保证记住这些信息，这会给您带来一些麻烦。编写函数来模仿这种行为更简洁，更容易实现。

def read_stdin():
    with sys.stdin as f:
        for line in f:
            yield line

def search_line_for_start_db(line):
    if "BEGIN DB" in line:
        search_db_for_info()

def search_db_for_info()
    while "END_DB" not in new_line: 
        new_line = read_line.next()
        if not new_line.strip():
            # Put your information somewhere
            raw_tables.append(line)

read_line = read_stdin()
raw_tables = []
while True:
    try:
        search_line_for_start_db(read_line.next())
    Except: #Your stdin stream has finished being read
        break #end your program

python - 在 Python 中延迟解析有状态的、多行的每条记录数据流？

2 回答 2

Related

Reference