python - 使用 Python 逐块切片数据

Question

大家，我有一个大文件，格式如下。数据采用“块”格式。一个“块”包含三行：时间 T、用户 U 和内容 W。例如，这是一个块：

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

因为我只会使用包含特定关键字的块。我将原始海量数据中的数据逐块切片，而不是将整个数据转储到内存中。每次读入一个块，如果该行内容包含“bike”一词，则将该块写入磁盘。

您可以使用以下两个块来测试您的脚本。

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

T   2009-06-11 21:57:23
U   charilie
W   i want a bike

我试图逐行做这项工作：

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

for line in data:
    if line.find("bike")!= -1:
    output.write(line)

score 1 · Accepted Answer

由于块的格式是恒定的，您可以使用列表来保存块，然后查看是否bike在该块中：

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

chunk = []
for line in data:
    chunk.append(line)
    if line[0] == 'W':
        if 'bike' in str(chunk):
            for line in chunk:
                output.write(line)
        chunk = []

score 1 · Accepted Answer

您可以使用正则表达式：

import re
data = open("OWS.txt", 'r').read()   # Read the entire file into a string
output = open("result.txt", 'w')

for match in re.finditer(
    r"""(?mx)          # Verbose regex, ^ matches start of line
    ^T\s+(?P<T>.*)\s*  # Match first line
    ^U\s+(?P<U>.*)\s*  # Match second line
    ^W\s+(?P<W>.*)\s*  # Match third line""", 
    data):
        if "bike" in match.group("W"):
            output.write(match.group())  # outputs entire match

python - 使用 Python 逐块切片数据

2 回答 2

Related

Reference