python - BeautifulSoup 的爬虫

Question

我正在尝试为学生研究创建一个网络爬虫。我已经完成了，但我想告诉我我使用的方式是否是最好的。（可能不是：p）

爬虫是针对 cnn 网站的，我唯一想得到的就是新闻的文本。

这是一个示例链接：链接

这是我的代码：

def cnn_crawler(link):
    req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) 
    usock = urllib2.urlopen(req)
    encoding = usock.headers.getparam('charset')
    page = usock.read().decode(encoding)
    usock.close()

    soup = BeautifulSoup(page)
    div = soup.find('div', attrs={'class': 'cnn_strycntntlft'})
    text = div.find_all('p')
    text.remove(soup.find('p', attrs={'class': 'cnn_strycbftrtxt'}))
    final = ""
    for entry in text:
            final = final + entry.get_text() + " "
    return final

score 1 · Accepted Answer

如果仅用于文本提取，您可以尝试使用 Goose packege

https://github.com/grangier/python-goose

链接在这里。如果你只需要文字，它就完美了

python - BeautifulSoup 的爬虫

1 回答 1

Related

Reference