html - 使用正则表达式在 Python 2.7 中解析 html - 不太明白

Question

抱歉有点笨，但我真的需要 Python 的帮助。

['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

所以我有这个元组，我需要剪掉那个 href 属性里面的内容和<a>标签里面的内容 - 基本上，我想要一个看起来像这样的元组：

[["needs to be cut out", "Foo to BAR"], ["this also needs to be cut out", "BAR to Foo"]]

在 href 属性里面有很多特殊的符号，例如，

<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">

我认为，如果我真的不需要尝试解析对象树而只需要网页中的几个 url 和单词，那么使用 HTML 解析器会遇到太多麻烦。但我真的不明白如何形成正则表达式。我形成的正则表达式似乎完全错误。所以我问是否有人可以帮助我。

score 1 · Accepted Answer

您可以使用 BeautifulSoup 来解析 HTML 实体。

根据您的问题，您已经有以下列表：

l = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

现在你需要的只是下面的代码。

from BeautifulSoup import BeautifulSoup

parsed_list = []

for each in l:
    soup = BeautifulSoup(each)
    parsed_list.append([soup.find('a')['href'], soup.find('a').contents[0]])

希望能帮助到你：）

score 1 · Accepted Answer

Just use a HTML parser anyway. Python comes with a few included, and the xml.etree.ElementTree API is easier to get working than a regular expression for even simple <a> tags with arbitrary attributes:

from xml.etree import ElementTree as ET

texts = []
for linktext in linkslist:
    link = ET.fromstring(linktext)
    texts.append([link.attrib['href'], link.text])

If you use ' '.join(link.itertext()) you can get the text out of anything nested under the <a> tag, if you find that some of the links have nested <span>, <b>, <i> or other inline tags to mark up the link text further:

for linktext in linkslist:
    link = ET.fromstring(linktext)
    texts.append([link.attrib['href'], ' '.join(link.itertext())])

This gives:

>>> from xml.etree import ElementTree as ET
>>> linkslist = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']     
>>> texts = []
>>> for linktext in linkslist:
...     link = ET.fromstring(linktext)
...     texts.append([link.attrib['href'], ' '.join(link.itertext())])
... 
>>> texts
[['needs to be cut out', 'Foo to BAR'], ['this also needs to be cut out', 'BAR to Foo']]

score 0 · Accepted Answer

我会为此使用 Easy Html Parser EHP。

查看https://github.com/iogf/ehp

lst = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>', '<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">']

data = [(tag.text(), attr.get('href'))for indi in lst
            for tag, name, attr in Html().feed(indi).walk() if attr.get('href')]


data

输出：

[('Foo to BAR', 'needs to be cut out'), ('BAR to Foo', 'this also needs to be cut out'), ('', u'?a=p.stops&direction_id=23600&interval=1&t=wml&l=en')]

html - 使用正则表达式在 Python 2.7 中解析 html - 不太明白

3 回答 3

Related

Reference