0

抱歉有点笨,但我真的需要 Python 的帮助。

['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

所以我有这个元组,我需要剪掉那个 href 属性里面的内容和<a>标签里面的内容 - 基本上,我想要一个看起来像这样的元组:

[["needs to be cut out", "Foo to BAR"], ["this also needs to be cut out", "BAR to Foo"]]

在 href 属性里面有很多特殊的符号,例如,

<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">

我认为,如果我真的不需要尝试解析对象树而只需要网页中的几个 url 和单词,那么使用 HTML 解析器会遇到太多麻烦。但我真的不明白如何形成正则表达式。我形成的正则表达式似乎完全错误。所以我问是否有人可以帮助我。

4

3 回答 3

1

您可以使用 BeautifulSoup 来解析 HTML 实体。

根据您的问题,您已经有以下列表:

l = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']

现在你需要的只是下面的代码。

from BeautifulSoup import BeautifulSoup

parsed_list = []

for each in l:
    soup = BeautifulSoup(each)
    parsed_list.append([soup.find('a')['href'], soup.find('a').contents[0]])

希望能帮助到你 :)

于 2012-12-27T05:14:20.063 回答
1

Just use a HTML parser anyway. Python comes with a few included, and the xml.etree.ElementTree API is easier to get working than a regular expression for even simple <a> tags with arbitrary attributes:

from xml.etree import ElementTree as ET

texts = []
for linktext in linkslist:
    link = ET.fromstring(linktext)
    texts.append([link.attrib['href'], link.text])

If you use ' '.join(link.itertext()) you can get the text out of anything nested under the <a> tag, if you find that some of the links have nested <span>, <b>, <i> or other inline tags to mark up the link text further:

for linktext in linkslist:
    link = ET.fromstring(linktext)
    texts.append([link.attrib['href'], ' '.join(link.itertext())])

This gives:

>>> from xml.etree import ElementTree as ET
>>> linkslist = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']     
>>> texts = []
>>> for linktext in linkslist:
...     link = ET.fromstring(linktext)
...     texts.append([link.attrib['href'], ' '.join(link.itertext())])
... 
>>> texts
[['needs to be cut out', 'Foo to BAR'], ['this also needs to be cut out', 'BAR to Foo']]
于 2012-12-26T19:44:12.677 回答
0

我会为此使用 Easy Html Parser EHP。

查看https://github.com/iogf/ehp

lst = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>', '<a href="?a=p.stops&amp;direction_id=23600&amp;interval=1&amp;t=wml&amp;l=en">']

data = [(tag.text(), attr.get('href'))for indi in lst
            for tag, name, attr in Html().feed(indi).walk() if attr.get('href')]


data

输出:

[('Foo to BAR', 'needs to be cut out'), ('BAR to Foo', 'this also needs to be cut out'), ('', u'?a=p.stops&direction_id=23600&interval=1&t=wml&l=en')]
于 2016-03-20T10:17:31.277 回答