python - Python 正则表达式可选捕获组或 lastindex

Question

我正在使用 python 逐行搜索文件的部分和子部分。

   *** Section with no sub section
  *** Section with sub section ***
           *** Sub Section ***
  *** Another section

部分以 0-2 个空格开头，后跟三个星号，子部分有 2+ 个空格，然后是星号。

我写出没有“***”的部分/子部分；目前（使用 re.sub）。

Section: Section with no sub section
Section: Section with sub section
Sub-Section: Sub Section
Section: Another Section

问题 1：是否有一个带有捕获组的 python 正则表达式可以让我将部分/子部分名称作为捕获组访问？

问题 2：正则表达式组如何允许我标识部分或子部分（可能基于 match.group 中 /content 的数量）？

示例（非工作）：

match=re.compile('(group0 *** )(group1 section title)(group2 ***)')
sectionTitle = match.group(1)
if match.lastindex = 0: sectionType = section with no subs
if match.lastindex = 1: sectionType = section with subs
if match.lastindex = 2: sectionTpe = sub section

以前的尝试 我已经能够使用单独的正则表达式和 if 语句来捕获部分或子部分，但我想一次完成所有操作。类似于下面的行；对第二组的贪婪有问题。

'(^\*{3}\s)(.*)(\s\*{3}$)'

我似乎无法让贪婪或可选组一起工作。 http://pythex.org/在这一点上非常有帮助。

另外，我尝试捕获星号“（* {3}）”，然后根据找到的组数确定是部分还是子部分。

sectionRegex=re.compile('(\*{3})'
m=re.search(sectionRegex)
  if m.lastindex == 0:
       sectionName = re.sub(sectionRegex,'',line) 
       #Set a section flag
  if m.lastindex ==1:
       sectionName = re.sub(sectionRegex,''line)
       #Set a sub section flag.

谢谢也许我完全错了。任何帮助表示赞赏。

最新更新 我一直在玩 Pythex、答案和其他研究。我现在花更多的时间来捕捉这些词：

^[a-zA-Z]+$

并计算星号匹配的数量以确定“级别”。我仍在寻找一个单一的正则表达式来匹配两个 - 三个“组”。可能不存在。

谢谢。

score 1 · Accepted Answer

问题 1：是否有一个带有捕获组的 python 正则表达式可以让我将部分/子部分名称作为捕获组访问？

一个正则表达式来匹配两个 - 三个“组”。可能不存在

是的，这是可以做到的。我们可以将条件分解为以下树：

Start of line + 0 to 2 spaces
2个交替中的任何一个：
1. *** + Any text^{[第 1 组]}
2. 1+ spaces + *** + Any text^{[第 2 组]}
***^（可选） + End of line

上面的树可以用以下模式表示：

^[ ]{0,2}(?:[*]{3}(.*?)|[ ]+[*]{3}(.*?))(?:[*]{3})?$

正则表达式 101 演示

请注意，Section和Sub-Section被不同的组（分别为^{[group 1]}和^{[group 2]}）捕获。它们都使用相同的语法.*?，都带有一个惰性量词（额外的“？”）以允许最后的可选"***"匹配。

问题 2：regexp 组如何允许我标识部分或子部分（可能基于 match.group 中 /content 的数量）？

上面的正则表达式仅捕获第1 组中的部分，仅捕获第 2 组中的子部分。为了更容易在代码中识别，我将使用(?P<named> groups)和检索捕获.groupdict()。

代码：

import re

data = """  *** Section with no sub section
  *** Section with sub section ***
           *** Sub Section ***
  *** Another section"""

pattern = r'^[ ]{0,2}(?:[*]{3}[ ]?(?P<Section>.*?)|[ ]+[*]{3}[ ]?(?P<SubSection>.*?))(?:[ ]?[*]{3})?$'
regex = re.compile(pattern, re.M)

for match in regex.finditer(data):
    print(match.groupdict())

''' OUTPUT:
{'Section': 'Section with no sub section', 'SubSection': None}
{'Section': 'Section with sub section', 'SubSection': None}
{'Section': None, 'SubSection': 'Sub Section'}
{'Section': 'Another section', 'SubSection': None}
'''

ideone 演示

您可以使用以下方法之一，而不是打印字典来引用每个Section / Subsection：

match.group("Section")
match.group(1)
match.group("SubSection")
match.group(2)

score 0 · Accepted Answer

正则表达式：

(^\s+)(\*{3})([a-zA-Z\s]+)(\*{3})*

捕获 3 或 4 个组，如下所述。

Group 0: "(^\s+)" Captures whitespace
Group 1: "(\*{3})" captures '***'
Group 2:"([a-zA-Z\s]+)" captures alpha characters and spaces
Group 3: "(\*{3})*" captures 0 or or more occurrences of "***"

score 0 · Accepted Answer

假设您的意思是小节有 3 个以上的空格，您可以执行以下操作：

import re

data = '''
  *** Section with no sub section
*** Section with sub section ***
           *** Sub Section ***
 *** Another section
'''

pattern = r'(?:(^ {0,2}\*{3}.*\*{3} *$)|(^ {0,2}\*{3}.*)|(^ *\*{3}.*\*{3} *$))'

regex = re.compile(pattern, re.M)
print regex.findall(data)

这将为您提供如下组：

[('', '  *** Section with no sub section', ''),
 ('*** Section with sub section ***', '', ''),
 ('', '', '           *** Sub Section ***'),
 ('', ' *** Another section', '')]

python - Python 正则表达式可选捕获组或 lastindex

3 回答 3

代码：

Related

Reference