python - 存在前缀时如何查找所有匹配项

Question

我正在寻找 HTML 页面中的重复模式。
我感兴趣的模式在前缀“<h2>Seasons</h2>”之后开始
。同样的模式也出现在前缀之前，我对那些不感兴趣。

我尝试（但失败了）以下 python 代码（为了使这个问题易于阅读，我将模式简化为 '<a href=.+?</a>'）：

matches = re.compile('<h2>Seasons</h2>.+?(<a href=.+?</a>)+',re.DOTALL).findall(page)  
for ref in matches  
   print ref

给定页面：

blah blah html stuff 
<h2>Seasons</h2>  
blah blah  more html stuff
<a href=http://www.111.com>111</a><a href=http://www.222.com>222</a><a href=http://www.333.com>333</a>

输出是

<a href=http://www.333.com>333</a>

所以它只打印最后一个匹配，其他两个不进入 findall 列表。如何遍历组的所有匹配项？

score 2 · Accepted Answer

问题是正则表达式只匹配一次。带括号的组匹配多次，但整个正则表达式只匹配一次。这意味着只返回一个匹配项，即最后一个匹配项。

要解决这个问题，您需要编写一个匹配多次的正则表达式。<h2>您可能会认为对元素使用后向断言，如下所示：

(?<=<h2>Seasons</h2>.+?)(<a href=.+?</a>)    # doesn't work

这表示要查找<a>元素，但前提是它们前面有<h2>Seasons</h2>. 不幸的是，lookbehind 字符串必须是固定长度的。你不能把.+?一个lookbehind断言。所以这种方法已经过时了。

接下来是首先找到<h2>元素的位置，然后从那里开始执行正则表达式搜索。

>>> re.findall('<a href=.+?</a>', page[page.find('<h2>Seasons</h2>'):], re.DOTALL)
['<a href=http://www.111.com>111</a>', '<a href=http://www.222.com>222</a>', '<a href=http://www.333.com>333</a>']

score 1 · Accepted Answer

1

你应该使用像BeautifulSoup这样的 html 解析器；会让你的生活轻松很多。

于 2012-12-19T23:16:06.513 回答

python - 存在前缀时如何查找所有匹配项

2 回答 2

Related

Reference