这怎么样:
In [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'
In [2]: import re
In [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
Out[3]:
['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',
'2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']
对于您的确切输出,我会执行以下操作:
In [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
In [5]: [tuple(f.split(' ',1)) for f in ns]
Out[5]:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
可能是更好的方法,但我的 python foo 不如我的正则表达式 foo。
正则说明:
(?<=\s) # Use positive look-behind to match a leading space but don't include it
\d # match digit
.*? # Match everything up till the next record (lazy)
# The following positive look-behinds is the key. It matches the start of
# each new record i.e
# 2 1. S
# 3 1. S
# 4 1. Q
# 0 1.$
# look-arounds match but don't seek past.
(?=\s\d\s\d[.](?=$|\s[A-Z]))
(?= # positive look-ahead 1
\s # space
\d # digit
\s # space
\d # digit
[.] # period
(?= # postive look-ahead 2
$ # end of string
| # OR
\s[A-Z] # space followed by uppercase letter
) # close look-ahead 1
) # close look-ahead 2