3

我试图找到在 Python 中解析文件并创建命名元组列表的最佳方法,每个元组代表一个数据实体及其属性。数据看起来像这样:

UI: T020  
STY: Acquired Abnormality  
ABR: acab   
STN: A1.2.2.2  
DEF: An abnormal structure, or one that is abnormal in size or location, found   
in or deriving from a previously normal structure.  Acquired abnormalities are  
distinguished from diseases even though they may result in pathological   
functioning (e.g., "hernias incarcerate").   
HL: {isa} Anatomical Abnormality

UI: T145   
RL: exhibits   
ABR: EX   
RIN: exhibited_by   
RTN: R3.3.2   
DEF: Shows or demonstrates.   
HL: {isa} performs   
STL: [Animal|Behavior]; [Group|Behavior]   

UI: etc...

虽然有几个属性是共享的(例如 UI),但有些不是(例如 STY)。但是,我可以硬编码一个详尽的必要列表。
由于每个分组都由一个空行分隔,因此我使用了 split 以便可以单独处理每个数据块:

input = file.read().split("\n\n")
for chunk in input:
     process(chunk)

我见过一些方法使用字符串查找/拼接、itertools.groupby 甚至正则表达式。我正在考虑做一个 '[AZ]*:' 的正则表达式来查找标题的位置,但我不确定如何在到达另一个标题之前拉出多行(例如在 DEF 之后的多行数据第一个示例实体)。

我很感激任何建议。

4

3 回答 3

2
source = """
UI: T020  
STY: Acquired Abnormality  
ABR: acab   
STN: A1.2.2.2  
DEF: An abnormal structure, or one that is abnormal in size or location, found   
in or deriving from a previously normal structure.  Acquired abnormalities are  
distinguished from diseases even though they may result in pathological   
functioning (e.g., "hernias incarcerate").   
HL: {isa} Anatomical Abnormality
"""

inpt = source.split("\n")  #just emulating file

import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
    line_match = reg.match(line) #check if we hit the CODE: Content line
    if line_match is not None:
        if current_key is not None:
            output[current_key] = current #if so - update the current_key with contents
        current_key = line_match.group(1)   
        current = line_match.group(2)
    else:
        current = current + line   #if it's not - it should be the continuation of previous key line

output[current_key] = current #don't forget the last guy
print(output)
于 2013-04-23T21:23:40.297 回答
2

我假设如果您在多行上有字符串跨度,您希望用空格替换换行符(并删除任何额外的空格)。

def process_file(filename):
    reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
    tmp = '' # Stored/cached data for mutliline string
    key = None # Current key
    data = {}

    with open(filename,'r') as f:
        for row in f:
            row = row.rstrip()
            match = reg.match(row)

            # Matches header or is end, put string to list:
            if (match or not row) and key:
                data[key] = tmp
                key = None
                tmp = ''

            # Empty row, next dataset
            if not row:
                # Prevent empty returns
                if data:
                    yield data
                    data = {}

                continue

            # We do have header
            if match:
                key = str(match.group(1))
                tmp = row[len(match.group(0)):]
                continue

            # No header, just append string -> here goes assumption that you want to
            # remove newlines, trailing spaces and replace them with one single space
            tmp += ' ' + row

    # Missed row?
    if key:
        data[key] = tmp

    # Missed group?
    if data:
        yield data

这个生成器在每次迭代中都返回dictUI: T020(并且总是至少一个项目)。

由于它使用生成器和连续读取,它应该是大文件的有效事件,它不会一次将整个文件读入内存。

这是一个小演示:

for data in process_file('data.txt'):
    print('-'*20)
    for i in data:
        print('%s:'%(i), data[i])

    print()

和实际输出:

--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure.  Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab

--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX
于 2013-04-23T21:30:56.217 回答
0
import re
from collections import namedtuple

def process(chunk):
    split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
    d = dict()
    fields = list()
    for i in xrange(len(split_chunk)/2):
        fields.append(split_chunk[i])
        d[split_chunk[i]] = split_chunk[i+1]
    my_tuple = namedtuple(split_chunk[1], fields)
    return my_tuple(**d)

应该做。我想我会这样做dict——你为什么如此依恋 a namedtuple

于 2013-04-23T21:38:47.710 回答