python - 网页抓取以形成新闻数据库

Question

我正在为不同的新闻媒体创建一个网络爬虫。我试图为The Hindu报纸创建一个。

我想从其档案中提到的各种链接中获取新闻。假设我想通过第二天提到的链接获取新闻：http://www.thehindu.com/archive/web/2010/06/19/即 2010 年 6 月 19 日。

现在我已经编写了以下代码行：

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.contents[0]
print articletext

但我无法获得所需的结果。我基本上被卡住了。有人可以帮我解决吗？

score 5 · Accepted Answer

试试下面的代码：

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

driver.close()

因为re您可能必须导入re模块。

score 1 · Accepted Answer

我建议你看看Scrapy。使用您的参数尝试他们的教程并进行试验。他们拥有比 mechanize 模块更发达的网络爬虫基础设施。

python - 网页抓取以形成新闻数据库

2 回答 2

Related

Reference