0

I have a huge number of strings to process in the following manner. For each string,the characters from position 3 through 15 need to be extracted ,except position 9.

So,for an input "F01MBBSGB50AGFX0000000000", the output will be "MBBSGB50AGFX".

The obvious way is s[3:11] + s[12:15].
But given the sheer magnitude of data that needs to be processed,I need help on the recommended way to do this.

4

1 回答 1

1

当我有这样的东西,要提取字符串的固定位置时,我喜欢使用 Python 切片来预定义要提取的感兴趣的字段。这可能有点矫枉过正,但它将所有字段位置和长度计数信息保存在一个易于管理的数据结构中,而不是通过代码散布[2:10],[12:15]等。

#         1         2
#123456789012345678901234
samples = """\
F01MBBSGB50AGFX0000000000
F01MBCSGB60AGFX0000000000
F01MBDSGB70AGFX0000000000""".splitlines()

# define the different slices you want to get from each line;
# can be arbitrarily many, can extend beyond the length of the
# input lines, can include 'None' to imply 0 as a start or 
# end-of-string as the end
indexes = [(3,9),(10,15)]

# convert to Python slices using 'slice' builtin
slices = [slice(*idx) for idx in indexes]

# make a marker to show slices that will be pulled out
# (assumes slices don't overlap, and no Nones)
marker = ''
off = 0
for idx in sorted(indexes):
    marker += ' '*(idx[0]-off) + '^'*(idx[1]-idx[0])
    off = idx[1]

# extract and concat
for s in samples:
    print s
    print marker
    print ''.join(s[slc] for slc in slices)
    print

印刷:

F01MBBSGB50AGFX0000000000
   ^^^^^^ ^^^^^
MBBSGB0AGFX

F01MBCSGB60AGFX0000000000
   ^^^^^^ ^^^^^
MBCSGB0AGFX

F01MBDSGB70AGFX0000000000
   ^^^^^^ ^^^^^
MBDSGB0AGFX

如果您愿意,您还可以使用(start,length)元组定义要提取的片段,如

fields = [(3,6), (10,5)]

然后将这些转换为切片:

slices = [slice(start,start+length) for start,length in fields]

上面的所有其余代码都保持不变。

于 2013-05-04T07:27:15.090 回答