python - 读取基于位置的文本文件的正确方法

Question

所以我有一个包含这种（标准化）格式数据的文件：

 12455WE READ THIS             TOO796445 125997  554777     
 22455 888AND THIS       TOO796445 125997  55477778 2 1

可能是被做了太多 cobol 的人强硬起来。

每个字段都有固定的长度，我可以通过切片来读取它。

我的问题是如何以一种更灵活的方式构造我的代码，并且不让我对切片使用硬编码的偏移量？我应该使用类似的常量类吗？

编辑：

同样，第一个数字（0-> 9 始终存在）决定了固定长度的行的结构。此外，该文件由确保有效性的第 3 方提供，因此我无需检查格式，只需读取它即可。大约有 11 种不同的线结构。

score 3 · Accepted Answer

我的建议是使用键入 5 位线型代码的字典。字典中的每个值都可以是字段偏移量列表（或（偏移量，宽度）元组），按字段位置索引。

如果您的字段有名称，则使用类而不是列表来存储字段偏移数据可能会很方便。但是，namedtuples这里可能会更好，因为那时您可以通过其名称或字段位置访问您的字段偏移数据，因此您可以两全其美。

namedtuples 实际上是作为类实现的，但是定义一个新namedtuple类型比创建显式类定义和namedtuples使用协议要紧凑得多，因此它们比用于存储其属性__slots__的普通类占用更少的 RAM 。__dict__

这是一种namedtuples用于存储字段偏移数据的方法。我并不是说下面的代码是最好的方法，但它应该给你一些想法。

from collections import namedtuple

#Create a namedtuple, `Fields`, containing all field names
fieldnames = [
    'record_type', 
    'special',
    'communication',
    'id_number',
    'transaction_code',
    'amount',
    'other',
]

Fields = namedtuple('Fields', fieldnames)

#Some fake test data
data = [
    #          1         2         3         4         5
    #012345678901234567890123456789012345678901234567890123
    "12455WE READ THIS             TOO796445 125997  554777",
    "22455 888AND THIS       TOO796445 125997  55477778 2 1",
]

#A dict to store the field (offset, width) data for each field in a record,
#keyed by record type, which is always stored at (0, 5)
offsets = {}

#Some fake record structures
offsets['12455'] = Fields(
    record_type=(0, 5), 
    special=None,
    communication=(5, 28),
    id_number=(33, 6),
    transaction_code=(40, 6),
    amount=(48, 6),
    other=None)

offsets['22455'] = Fields( 
    record_type=(0, 5),
    special=(6, 3),
    communication=(9, 18),
    id_number=(27, 6),
    transaction_code=(34, 6),
    amount=(42, 8),
    other=(51,3))

#Test.
for row in data:
    print row
    #Get record type
    rt = row[:5]
    #Get field structure
    fields = offsets[rt]
    for name in fieldnames:
        #Get field offset data by field name
        t = getattr(fields, name)
        if t is not None:
            start, flen = t
            stop = start + flen
            data = row[start : stop]            
            print "%-16s ... %r" % (name, data)
    print

输出

12455WE READ THIS             TOO796445 125997  554777
record_type      ... '12455'
communication    ... 'WE READ THIS             TOO'
id_number        ... '796445'
transaction_code ... '125997'
amount           ... '554777'

22455 888AND THIS       TOO796445 125997  55477778 2 1
record_type      ... '22455'
special          ... '888'
communication    ... 'AND THIS       TOO'
id_number        ... '796445'
transaction_code ... '125997'
amount           ... '55477778'
other            ... '2 1'

score 1 · Accepted Answer

创建一个宽度列表和一个接受这个和一个索引列号作为参数的例程。该例程可以通过添加所有先前的列宽来计算切片的起始偏移量，并为结束偏移量添加索引列的宽度。

score 1 · Accepted Answer

您可以有一个描述格式的列的宽度列表，并像这样展开它：

formats = [
    [1, ],
    [1, 4, 28, 7, 7, 7],
]

def unfold(line):
    lengths = formats[int(line[0])]
    ends = [sum(lengths[0:n+1]) for n in range(len(lengths))]
    return [line[s:e] for s,e in zip([0] + ends[:-1], ends)]

lines = [
    "12455WE READ THIS             TOO796445 125997 554777",
]

for line in lines:
    print unfold(line)

编辑：更新代码以更好地匹配maazza在编辑问题中提出的问题。这假定格式字符是一个整数，但它可以很容易地推广到其他格式指示符。

python - 读取基于位置的文本文件的正确方法

3 回答 3

Related

Reference