python - 使用 Python 解析多行数据文件

Question

用 Python 解析以下多行数据文件的最佳方法是什么？

Police Response: 11/6/2012 1:34:06 AM   Incident Desc: Traffic Stop OFC:    Received: 11/6/2012 1:34:06 AM
Disp: PCHK  Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941   ID: 60941   Priority: 6 Case No:
Police Response:    Incident Desc: Theft    OFC:    Received: 11/6/2012 1:43:35 AM
Disp: CSR   Location: SCH BLACHLY
Event Number: LLS121106060943   ID: 60943   Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM   Incident Desc: Suspicious Vehicle(s)    OFC:        Received: 11/6/2012 1:47:47 AM
Disp: FI    Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944   ID: 60944   Priority: 6 Case No:

记录总是分成 3 行——以“Police Response”开头并以“Event Number”结尾的行。有些字段通常是空白的。

score 10 · Accepted Answer

这应该可以解决问题。我将您拥有的数据拆分为一个案例列表，每个案例都包含您的数据行。然后我使用正则表达式 spiting 按字段名称进行拆分。之后，我将键值对列表放入字典中，这样您就可以轻松地遍历案例并使用字典访问任何字段值。我打印出行的内容只是为了显示数据结构。

代码

from pprint import pprint
from collections import OrderedDict
import re

data = """Police Response: 11/6/2012 1:34:06 AM   Incident Desc: Traffic Stop OFC:    Received: 11/6/2012 1:34:06 AM
Disp: PCHK  Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941   ID: 60941   Priority: 6 Case No:
Police Response:    Incident Desc: Theft    OFC:    Received: 11/6/2012 1:43:35 AM
Disp: CSR   Location: SCH BLACHLY
Event Number: LLS121106060943   ID: 60943   Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM   Incident Desc: Suspicious Vehicle(s)    OFC:        Received: 11/6/2012 1:47:47 AM
Disp: FI    Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944   ID: 60944   Priority: 6 Case No: """

lines = data.splitlines()
cases = ['\n'.join(lines[i:i+3]) for i in range(0, len(lines), 3)]
pattern = '(Police Response|Incident Desc|OFC|Received|Disp|Location|Event Number|ID|Priority|Case No):'
rows = []
for case in cases:
    pairs =  re.split(pattern, case)[1:]
    rows.append(OrderedDict((pairs[i*2], pairs[i*2+1]) for i in range(10)))

for i, row in enumerate(rows):
    print '============== {} =============='.format(i)
    pprint(row.items())

输出：

============== 0 ==============
[('Police Response', ' 11/6/2012 1:34:06 AM   '),
 ('Incident Desc', ' Traffic Stop '),
 ('OFC', '    '),
 ('Received', ' 11/6/2012 1:34:06 AM\n'),
 ('Disp', ' PCHK  '),
 ('Location', ' CLEAR LAKE RD&GREEN HILL RD\n'),
 ('Event Number', ' LLS121106060941   '),
 ('ID', ' 60941   '),
 ('Priority', ' 6 '),
 ('Case No', '')]
============== 1 ==============
[('Police Response', '    '),
 ('Incident Desc', ' Theft    '),
 ('OFC', '    '),
 ('Received', ' 11/6/2012 1:43:35 AM\n'),
 ('Disp', ' CSR   '),
 ('Location', ' SCH BLACHLY\n'),
 ('Event Number', ' LLS121106060943   '),
 ('ID', ' 60943   '),
 ('Priority', ' 4 '),
 ('Case No', '')]
============== 2 ==============
[('Police Response', ' 11/6/2012 1:47:47 AM   '),
 ('Incident Desc', ' Suspicious Vehicle(s)    '),
 ('OFC', '        '),
 ('Received', ' 11/6/2012 1:47:47 AM\n'),
 ('Disp', ' FI    '),
 ('Location', ' KIRK RD&CLEAR LAKE RD\n'),
 ('Event Number', ' LLS121106060944   '),
 ('ID', ' 60944   '),
 ('Priority', ' 6 '),
 ('Case No', ' ')]

score 3 · Accepted Answer

最大的问题：

什么是用来分隔条目的？如果条目之间有制表符，这很容易，只需按制表符分割每一行。如果总是至少有两个空格，你可以按那个来分割。如果有时只有一个空格，那会使事情复杂化。

否则，很容易使生成器/函数一次吐出三行，然后您可以将其放入解析三行的函数中。问题的“一次 3 行”部分是简单的部分。

def return_3(file):
    return [file.next() for i in range(3)]

score 0 · Accepted Answer

这个正则表达式应该工作：

data = open('file.dat').read()

re.findall("""Police Response:(.*)Incident Desc:(.*)OFC:(.*)Received:(.*)
Disp:(.*)Location:(.*)
Event Number:(.*)Priority:(.*)Case No:(.*)""", data)

score 0 · Accepted Answer

假设输入数据格式是一致的，我可能会采用以下方法：

# List of fields. Corresponds to colums and rows in input data.
fields = (
  ("Police Response", "Incident Desc", "OFC", "Received"),
  ("Disp", "Location"),                                    
  ("Event Number", "ID", "Priority", "Case No")
)

# generate pattern based on fields
patterns = [re.compile(":(.*)".join(f) + ":(.*)") for f in fields]

在这里，我们根据字段列表生成搜索模式。这使得查看和更新预期的数据格式变得容易。

使用生成的模式，我们可以将相应的字符串列表解析为以字段名称为键的字典。

def parse_record(lines):
  out = {}
  for f, p, s in zip(fields, patterns, lines):
     out.update(zip(f, [s.strip() for s in p.match(s).groups()]))
  return out

为简洁起见，我省略了错误检查，但如果输入数据不符合预期，添加一些检查将允许我们打印更友好的错误消息。特别是，断言并捕获返回len(lines) == len(fields)时引发的异常。p.match(s)None

最后一部分是按每条记录的数量或行对输入数据进行分组。这可以很容易地使用grouper()配方完成。

这是一个例子：

for lines in grouper(len(fields), open("input_data.txt"):
  record = parse_record(lines)
  print record["ID"], record["Incident Desc"]  # do something with the dict

python - 使用 Python 解析多行数据文件

4 回答 4

Related

Reference