我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我成功地使用了一个用 Scrapy 构建的递归 webscraper 来解析维基百科文章中的所有文本。但是,我正在尝试将相同的代码应用于代码中引用的网站,但它没有返回任何文本正文:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1
rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
#rules = [Rule(SgmlLinkExtractor(allow=()),
#follow=True),
#Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
#]
#rules = [
#Rule(
#SgmlLinkExtractor(allow=('Regions/252/Tournaments/2',)),
#callback='parse_item',
#follow=True,
#)
#]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal3'])
我可能想查看的示例页面是这样的:
http://www.whoscored.com/Articles/pn4gahfw90kjwje-yx7ztq/Show/Player-Focus-Potential-Change-in-System-may-Convince-Vidal-to-Leave-Juventus
据我了解,上面的代码应该提取页面上找到的任何文本字符串并将它们连接在一起。上面示例页面的 HTML 标记用<p>
标签封装文本,所以我不确定为什么这不起作用。谁能看到一个明显的原因,为什么我得到的只是使用此代码的页脚?