python - python正则表达式匹配0或1个重复

Question

我想将 html<h1> - <h6>中的 html 标头与 python 正则表达式匹配。一些标题包含'id'属性，我想把它放到一个组中。

通过尝试以下表达式，我得到了具有 id 属性的表达式。

>>>re.findall(r'<h[1-6].*?(id=\".*?\").*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['id="header2"']

问号使 RE 匹配前面 RE 的 0 次或 1 次重复。如果我放一个？在右括号之后，它将返回两个空字符串。

>>>re.findall(r'<h[1-6].*?(id=\".*?\")?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['', '']

如何使用一个正则表达式得到以下结果？

['', 'id="header2"']

score 5 · Accepted Answer

您使用了错误的工具。不要使用正则表达式来解析 HTML。请改用 HTML 解析器。

BeautifulSoup 库使您的任务变得微不足道：

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlsource)

headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print [h.attrs.get('id', '') for h in headers]

演示：

>>> from bs4 import BeautifulSoup
>>> htmlsource = '<h1>Header1</h1><h2 id="header2">header2</h2>'
>>> soup = BeautifulSoup(htmlsource)
>>> headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
>>> [h.attrs.get('id', '') for h in headers]
['', 'header2']

score 1 · Accepted Answer

这 '。' 不匹配空格，因此您需要明确包含它们。一种可能性是：

>>> re.findall(r'<h[1-6].*?( +id=\".*?\" ?)?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['', ' id="header2"']

python - python正则表达式匹配0或1个重复

2 回答 2

Related

Reference