python - python 正则表达式问题匹配td元素中的链接

Question

我正在尝试使用正则表达式来匹配表格中的单元格，但问题是并非所有单元格都遵循相同的模式。例如，td 可能采用以下格式：

<td><a href="page101010.html">PageNumber</a></td>

或这种格式：

<td align="left" ></td>

基本上，td 中的超链接部分并不存在，它只是在一些。

我尝试使用下面的 python 正则表达式代码匹配这种情况，但它失败了。

match = re.search(r'<td align="left" ><?a?.+\>?(.+)\<?\/?a?\>?\<\/td\>', tdlink)

我只需要“匹配”即可找到上面 () 中包含的部分。但是我收到语法错误或无对象消息。

我哪里错了？

score 6 · Accepted Answer

You are using a regular expression, and matching XML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)

1 回答 1