python - 在python中获取正则表达式的所有实例

Question

我正在尝试使用以下内容获取所有链接 innerHTML

import re

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>'
match = re.findall(r'<a.*>(.*)</a>', s)

for string in match:
    print(string)

但我只得到最后一次出现，“转到第 4 页”我认为它看到了一个大字符串和几个匹配的正则表达式，它们被视为重叠并被忽略。那么，我如何获得匹配的集合

['转到第 1 页'、'转到第 2 页'、'转到第 3 页'、'转到第 4 页']

score 2 · Accepted Answer

您的直接问题是正则表达式是贪婪的，即它们会尝试使用尽可能长的字符串。所以你是正确的，它一直在寻找它可以找到的最后一个</a>。将其更改为非贪婪 ( .*?)：

match = re.findall(r'<a.*?>(.*?)</a>', s)
                             ^

然而，这是一种可怕的 HTML 解析方式，而且不够健壮，并且会因最小的更改而中断。

这是一种更好的方法：

from bs4 import BeautifulSoup

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>'
soup = BeautifulSoup(s)
print [el.string for el in soup('a')]
# [u'Go to 1', u'Go to page 2', u'Go to page 3', u'Go to page 4']

然后，您可以使用它的强大功能来获取 href 以及文本，例如：

print [[el.string, el['href'] ]for el in soup('a', href=True)]
# [[u'Go to 1', 'page1.html'], [u'Go to page 2', 'page2.html'], [u'Go to page 3', 'page3.html'], [u'Go to page 4', 'page4.html']]

score 2 · Accepted Answer

我会不惜一切代价避免使用正则表达式解析 HTML。根据原因查看这篇文章和这篇 SO 帖子。但总结起来...

每次你试图用正则表达式解析 HTML 时，邪恶的孩子都会流着处女的血，俄罗斯黑客会破解你的 webapp

Instead I would take a look at a python HTML parsing package like BeautifulSoup or pyquery. They provide nice interfaces to traverse, retrieve, and edit HTML.

score 1 · Accepted Answer

我建议使用 lxml：

from lxml import etree

s = 'some html'
tree = etree.fromstring(s)
for ele in tree.iter('*'):
    #do something

它为大文件进程提供了 iterParse 函数，也接受了类似文件的对象，如 urllib2.request 对象。我一直在使用它来解析 html 和 xml。

见： http: //lxml.de/tutorial.html#the-element-class

python - 在python中获取正则表达式的所有实例

3 回答 3

Related

Reference