python - 在python中使用正则表达式从锚标签中提取数据

Question

假设我的文本字符串是：

text = '<a href="/status/ALL">ALL</a></td>/n<a href="/status/ASSIGN">ASSIGN</a></td>'

我想提取 ALL 和 ASSIGN，我正在使用这个正则表达式：

re.findall(r'<a href=.*>(\w+)</a>', text, re.DOTALL)

这只是返回分配。

有人可以帮我指出正则表达式中的错误吗？我对这个话题真的很陌生。

score 2 · Accepted Answer

您正在使用正则表达式，并且将 XML 与此类表达式匹配变得太复杂、太快。

请不要为难自己，而是使用 HTML 解析器，Python 有几个可供选择：

元素树示例：

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('a'):
    print ElementTree.tostring(elem)

1 回答 1