python - 美丽的汤虫？

Question

我有下一个代码：

for table in soup.findAll("table","tableData"):
    for row in table.findAll("tr"):
        data = row.findAll("td")
        url = data[0].a
        print type(url)

我得到下一个输出：

<class 'bs4.element.Tag'>

这意味着，该 url 是类 Tag 的对象，我可以从该对象中获取属性字节。但是如果我替换print type(url)为print url['href']我得到下一个回溯

Traceback (most recent call last):
File "baseCreator.py", line 57, in <module>
    createStoresTable()
File "baseCreator.py", line 46, in createStoresTable
    print url['href']
TypeError: 'NoneType' object has no attribute '__getitem__'

怎么了？以及如何获得 href 属性的值。

score 2 · Accepted Answer

我确实喜欢BeautifulSoup，但我个人更喜欢lxml.html（因为不是太古怪的 HTML），因为它能够利用 XPath。

import lxml.html
page = lxml.html.parse('http://somesite.tld')
print page.xpath('//tr/td/a/@href')

尽管取决于结构，但可能需要实现某种形式的“轴”。

您还可以elementsoup用作解析器 - 详细信息位于http://lxml.de/elementsoup.html

python - 美丽的汤虫？

1 回答 1

Related

Reference