python - TypeError：“NoneType”对象在 Python 中与 BeautifulSoup 一起使用时不可调用

Question

今天我在玩 BeautifulSoup 和 Requests API。所以我想我会写一个简单的爬虫，它会跟随深度为 2 的链接（如果这有意义的话）。我正在抓取的网页中的所有链接都是相对的。（例如：）<a href="/free-man-aman-sethi/books/9788184001341.htm" title="A Free Man">所以为了让它们绝对，我想我会使用urljoin.

为此，我必须首先从<a>标签中提取 href 值，为此我认为我会使用split：

#!/bin/python
#crawl.py
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

html_source=requests.get("http://www.flipkart.com/books")
soup=BeautifulSoup(html_source.content)
links=soup.find_all("a")
temp=links[0].split('"')

这给出了以下错误：

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    temp=links[0].split('"')
TypeError: 'NoneType' object is not callable

在正确阅读文档之前潜入水中，我意识到这可能不是实现我的目标的最佳方式，但为什么会有 TypeError？

score 5 · Accepted Answer

links[0]不是字符串，而是bs4.element.Tag. 当您尝试在其中查找时split，它会施展魔法并尝试找到名为的子元素split，但没有。你称之为无。

In [10]: l = links[0]

In [11]: type(l)
Out[11]: bs4.element.Tag

In [17]: print l.split
None

In [18]: None()   # :)

TypeError: 'NoneType' object is not callable

使用索引查找 HTML 属性：

In [21]: links[0]['href']
Out[21]: '/?ref=1591d2c3-5613-4592-a245-ca34cbd29008&_pop=brdcrumb'

或者get，如果存在不存在属性的危险：

In [24]: links[0].get('href')
Out[24]: '/?ref=1591d2c3-5613-4592-a245-ca34cbd29008&_pop=brdcrumb'


In [26]: print links[0].get('wharrgarbl')
None

In [27]: print links[0]['wharrgarbl']

KeyError: 'wharrgarbl'

score 1 · Accepted Answer

因为Tag该类使用代理来访问属性（正如 Pavel 指出的那样 - 这用于在可能的情况下访问子元素），所以在找不到它的地方None返回默认值。

复杂的例子：

>>> print soup.find_all('a')[0].bob
None
>>> print soup.find_all('a')[0].foobar
None
>>> print soup.find_all('a')[0].split
None

你需要使用：

soup.find_all('a')[0].get('href')

在哪里：

>>> print soup.find_all('a')[0].get
<bound method Tag.get of <a href="test"></a>>

score 1 · Accepted Answer

我刚刚遇到了同样的错误 - 所以四年后它的价值：如果您需要拆分汤元素，您也可以在拆分之前对其使用 str() 。在你的情况下，这将是：

    temp = str(links).split('"')

python - TypeError：“NoneType”对象在 Python 中与 BeautifulSoup 一起使用时不可调用

3 回答 3

Related

Reference