0

我正在尝试解析具有多行的字符串。

假设它是:

text = '''
Section1
stuff belonging to section1
stuff belonging to section1
stuff belonging to section1
Section2
stuff belonging to section2
stuff belonging to section2
stuff belonging to section2
'''

我想使用 re 模块的 finditer 方法来获取字典,例如:

{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\n'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2\n'}

我尝试了以下方法:

import re
re_sections=re.compile(r"(?P<section>Section\d)\s*(?P<section_data>.+)", re.DOTALL)
sections_it = re_sections.finditer(text)

for m in sections_it:
    print m.groupdict() 

但这会导致:

{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to    section1\nstuff belonging to section1\nSection2\nstuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2\n'}

所以section_data 也匹配Section2。

我还试图告诉第二组匹配除第一组以外的所有组。但这导致根本没有输出。

re_sections=re.compile(r"(?P<section>Section\d)\s+(?P<section_data>^(?P=section))", re.DOTALL)

我知道我可以使用以下 re,但我正在寻找一个版本,我不必告诉第二组是什么样子。

re_sections=re.compile(r"(?P<section>Section\d)\s+(?P<section_data>[a-z12\s]+)", re.DOTALL)

非常感谢你!

4

1 回答 1

1

使用前瞻来匹配下一节标题或字符串末尾的所有内容:

re_sections=re.compile(r"(?P<section>Section\d)\s*(?P<section_data>.+?)(?=(?:Section\d|$))", re.DOTALL)

请注意,这也需要非贪婪.+?,否则它仍然会一直匹配到最后。

演示:

>>> re_sections=re.compile(r"(?P<section>Section\d)\s*(?P<section_data>.+?)(?=(?:Section\d|$))", re.DOTALL)
>>> for m in re_sections.finditer(text): print m.groupdict()
... 
{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\n'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2'}
于 2013-04-11T15:55:19.813 回答