0

我正在尝试从新闻网站页面(从其档案之一)获取链接。我在 Python 中编写了以下代码行:

main.py包含:

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.contents[0]

print articletext

tag.contents[0] 中的对象示例: <a href="http://www.thehindu.com/business/itc-to-issue-11-bonus/article472545.ece" target="_blank">ITC to issue 1:1 bonus</a>

但是在运行它时,我收到以下错误:

File "C:\Python27\crawler\main.py", line 4, in <module>
    text = articletext.getArticle(url)
  File "C:\Python27\crawler\articletext.py", line 23, in getArticle
    return getArticleText(htmltext)
  File "C:\Python27\crawler\articletext.py", line 18, in getArticleText
    articletext += tag.contents[0]
TypeError: cannot concatenate 'str' and 'Tag' objects

有人可以帮我整理一下吗??我是 Python 编程的新手。谢谢并恭祝安康。

4

3 回答 3

3

您正在模糊地使用 link_dictionary。如果您不将其用于阅读目的,请尝试以下代码:

 br =  mechanize.Browser()
 htmltext = br.open(url).read()

 articletext = ""
 for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

注意:re用于正则表达式。为此,您导入re.

于 2013-11-13T10:55:42.153 回答
3

我相信您可能想尝试访问列表项中的文本,如下所示:

for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.string

已编辑:关于从页面获取链接的一般评论

用于收集一堆链接并稍后检索它们的最简单的数据类型可能是字典。

要使用 BeautifulSoup 从页面获取链接,您可以执行以下操作:

link_dictionary = {}
with urlopen(url_source) as f:
    soup = BeautifulSoup(f)
    for link in soup.findAll('a'):
        link_dictionary[link.string] = link.get('href') 

这将为您提供一个名为 的字典link_dictionary,其中字典中的每个键都是一个字符串,它只是<a> </a>标签之间的文本内容,每个值都是href属性的值。


如何结合你之前的尝试

现在,如果我们将此与您之前遇到的问题结合起来,我们可以尝试以下方法:

link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    for link in tag.findAll('a'):
        link_dictionary[link.string] = link.get('href') 

如果这没有意义,或者您有更多问题,您将需要先进行试验并尝试提出解决方案,然后再提出另一个新的、更清晰的问题。

于 2013-11-11T19:46:43.437 回答
2

您可能希望将功能强大的 XPath 查询语言与更快的lxml模块一起使用。就如此容易:

import urllib2
from lxml import etree

url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Business']/a"):
    print '{} ({})'.format(link.text, link.attrib['href'])

@data-section='Chennai' 的更新

#!/usr/bin/python
import urllib2
from lxml import etree

url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Chennai']/a"):
    print '{} => {}'.format(link.text, link.attrib['href'])
于 2013-11-11T22:10:53.883 回答