python - 获取 TypeError：python 中的预期字符串或缓冲区

Question

我有这个简单的代码：

#usr/bin/python

from bs4 import BeautifulSoup
import requests
import tldextract

def scrape(url):
    main_domain = tldextract.extract(url)
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    list = []
    for href in soup.find_all('a'):
    link_domain = tldextract.extract(href.get('href'))
    print link_domain
    print

获取错误为：

Traceback (most recent call last):
File "cloud.py", line 20, in <module>
scrape("--- url here -- ")
File "cloud.py", line 14, in scrape
link_domain = tldextract.extract(href.get('href'))
File "/usr/lib/python2.6/site-packages/tldextract/tldextract.py", line 196, in extract 
return TLD_EXTRACTOR(url)
File "/usr/lib/python2.6/site-packages/tldextract/tldextract.py", line 127, in __call__
netloc = SCHEME_RE.sub("", url) \

TypeError: expected string or buffer

我该如何解决。

score 0 · Accepted Answer

您的某些a标签没有href属性，因此.get('href')返回None。

利用：

link_domain = tldextract.extract(href.get('href', ''))

在这种情况下返回一个空字符串，或者首先测试属性：

href = href.get('href')
if not href:
    continue

link_domain = tldextract.extract(href)

python - 获取 TypeError：python 中的预期字符串或缓冲区

1 回答 1

Related

Reference