8

用 Python 解析以下多行数据文件的最佳方法是什么?

Police Response: 11/6/2012 1:34:06 AM   Incident Desc: Traffic Stop OFC:    Received: 11/6/2012 1:34:06 AM
Disp: PCHK  Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941   ID: 60941   Priority: 6 Case No:
Police Response:    Incident Desc: Theft    OFC:    Received: 11/6/2012 1:43:35 AM
Disp: CSR   Location: SCH BLACHLY
Event Number: LLS121106060943   ID: 60943   Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM   Incident Desc: Suspicious Vehicle(s)    OFC:        Received: 11/6/2012 1:47:47 AM
Disp: FI    Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944   ID: 60944   Priority: 6 Case No:

记录总是分成 3 行——以“Police Response”开头并以“Event Number”结尾的行。有些字段通常是空白的。

4

4 回答 4

10

这应该可以解决问题。我将您拥有的数据拆分为一个案例列表,每个案例都包含您的数据行。然后我使用正则表达式 spiting 按字段名称进行拆分。之后,我将键值对列表放入字典中,这样您就可以轻松地遍历案例并使用字典访问任何字段值。我打印出行的内容只是为了显示数据结构。

代码

from pprint import pprint
from collections import OrderedDict
import re

data = """Police Response: 11/6/2012 1:34:06 AM   Incident Desc: Traffic Stop OFC:    Received: 11/6/2012 1:34:06 AM
Disp: PCHK  Location: CLEAR LAKE RD&GREEN HILL RD
Event Number: LLS121106060941   ID: 60941   Priority: 6 Case No:
Police Response:    Incident Desc: Theft    OFC:    Received: 11/6/2012 1:43:35 AM
Disp: CSR   Location: SCH BLACHLY
Event Number: LLS121106060943   ID: 60943   Priority: 4 Case No:
Police Response: 11/6/2012 1:47:47 AM   Incident Desc: Suspicious Vehicle(s)    OFC:        Received: 11/6/2012 1:47:47 AM
Disp: FI    Location: KIRK RD&CLEAR LAKE RD
Event Number: LLS121106060944   ID: 60944   Priority: 6 Case No: """

lines = data.splitlines()
cases = ['\n'.join(lines[i:i+3]) for i in range(0, len(lines), 3)]
pattern = '(Police Response|Incident Desc|OFC|Received|Disp|Location|Event Number|ID|Priority|Case No):'
rows = []
for case in cases:
    pairs =  re.split(pattern, case)[1:]
    rows.append(OrderedDict((pairs[i*2], pairs[i*2+1]) for i in range(10)))

for i, row in enumerate(rows):
    print '============== {} =============='.format(i)
    pprint(row.items())

输出:

============== 0 ==============
[('Police Response', ' 11/6/2012 1:34:06 AM   '),
 ('Incident Desc', ' Traffic Stop '),
 ('OFC', '    '),
 ('Received', ' 11/6/2012 1:34:06 AM\n'),
 ('Disp', ' PCHK  '),
 ('Location', ' CLEAR LAKE RD&GREEN HILL RD\n'),
 ('Event Number', ' LLS121106060941   '),
 ('ID', ' 60941   '),
 ('Priority', ' 6 '),
 ('Case No', '')]
============== 1 ==============
[('Police Response', '    '),
 ('Incident Desc', ' Theft    '),
 ('OFC', '    '),
 ('Received', ' 11/6/2012 1:43:35 AM\n'),
 ('Disp', ' CSR   '),
 ('Location', ' SCH BLACHLY\n'),
 ('Event Number', ' LLS121106060943   '),
 ('ID', ' 60943   '),
 ('Priority', ' 4 '),
 ('Case No', '')]
============== 2 ==============
[('Police Response', ' 11/6/2012 1:47:47 AM   '),
 ('Incident Desc', ' Suspicious Vehicle(s)    '),
 ('OFC', '        '),
 ('Received', ' 11/6/2012 1:47:47 AM\n'),
 ('Disp', ' FI    '),
 ('Location', ' KIRK RD&CLEAR LAKE RD\n'),
 ('Event Number', ' LLS121106060944   '),
 ('ID', ' 60944   '),
 ('Priority', ' 6 '),
 ('Case No', ' ')]
于 2012-11-08T18:43:06.040 回答
3

最大的问题:

什么是用来分隔条目的?如果条目之间有制表符,这很容易,只需按制表符分割每一行。如果总是至少有两个空格,你可以按那个来分割。如果有时只有一个空格,那会使事情复杂化。

否则,很容易使生成器/函数一次吐出三行,然后您可以将其放入解析三行的函数中。问题的“一次 3 行”部分是简单的部分。

def return_3(file):
    return [file.next() for i in range(3)]
于 2012-11-08T18:28:07.250 回答
0

这个正则表达式应该工作:

data = open('file.dat').read()

re.findall("""Police Response:(.*)Incident Desc:(.*)OFC:(.*)Received:(.*)
Disp:(.*)Location:(.*)
Event Number:(.*)Priority:(.*)Case No:(.*)""", data)
于 2012-11-08T19:07:07.077 回答
0

假设输入数据格式是一致的,我可能会采用以下方法:

# List of fields. Corresponds to colums and rows in input data.
fields = (
  ("Police Response", "Incident Desc", "OFC", "Received"),
  ("Disp", "Location"),                                    
  ("Event Number", "ID", "Priority", "Case No")
)

# generate pattern based on fields
patterns = [re.compile(":(.*)".join(f) + ":(.*)") for f in fields]

在这里,我们根据字段列表生成搜索模式。这使得查看和更新​​预期的数据格式变得容易。

使用生成的模式,我们可以将相应的字符串列表解析为以字段名称为键的字典。

def parse_record(lines):
  out = {}
  for f, p, s in zip(fields, patterns, lines):
     out.update(zip(f, [s.strip() for s in p.match(s).groups()]))
  return out

为简洁起见,我省略了错误检查,但如果输入数据不符合预期,添加一些检查将允许我们打印更友好的错误消息。特别是,断言并捕获返回len(lines) == len(fields)时引发的异常。p.match(s)None

最后一部分是按每条记录的数量或行对输入数据进行分组。这可以很容易地使用grouper()配方完成。

这是一个例子:

for lines in grouper(len(fields), open("input_data.txt"):
  record = parse_record(lines)
  print record["ID"], record["Incident Desc"]  # do something with the dict
于 2012-11-09T10:06:52.193 回答