python - 为什么我的链接提取不起作用？

Question

我正在学习 Beautiful Soup，并试图从http://www.popsci.com页面中提取所有链接......但我遇到了语法错误。

这段代码应该可以工作，但它不适用于我尝试过的任何页面。我试图找出为什么它不工作。

这是我的代码：

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.popsci.com/"

page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

sci=soup.findAll('a')

for eachsci in sci:
    print eachsci['href']+","+eachsci.string

...这是我得到的错误：

Traceback (most recent call last):
  File "/root/Desktop/3.py", line 12, in <module>
    print eachsci['href']+","+eachsci.string
TypeError: coercing to Unicode: need string or buffer, NoneType found
[Finished in 1.3s with exit code 1]

score 2 · Accepted Answer

当a元素不包含文本时，eachsci.string是None- 并且您无法None使用运算符与字符串连接+，就像您尝试做的那样。

如果替换eachsci.string为eachsci.text，则该错误已解决，因为当元素为空时eachsci.text包含空字符串，并且将其与另一个字符串连接没有问题。''a

a然而，当你点击一个没有属性的元素时，你会遇到另一个问题href——当这种情况发生时，你会得到一个KeyError.

您可以解决这个问题 using dict.get()，如果键不在字典中，它能够返回默认值（该a元素假装是字典，所以这是可行的）。

将所有这些放在一起，这是您for循环的一个变体：

for eachsci in sci:
    print eachsci.get('href', '[no href found]') + "," + eachsci.text

python - 为什么我的链接提取不起作用？

1 回答 1

Related

Reference