1

我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我成功地使用了一个用 Scrapy 构建的递归 webscraper 来解析维基百科文章中的所有文本。但是,我正在尝试将相同的代码应用于代码中引用的网站,但它没有返回任何文本正文:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
    #rules = [Rule(SgmlLinkExtractor(allow=()), 
                  #follow=True),
             #Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    #]
    #rules = [
        #Rule(
            #SgmlLinkExtractor(allow=('Regions/252/Tournaments/2',)), 
            #callback='parse_item',
            #follow=True,
        #)
    #]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

我可能想查看的示例页面是这样的:

http://www.whoscored.com/Articles/pn4gahfw90kjwje-yx7ztq/Show/Player-Focus-Potential-Change-in-System-may-Convince-Vidal-to-Leave-Juventus 据我了解,上面的代码应该提取页面上找到的任何文本字符串并将它们连接在一起。上面示例页面的 HTML 标记用<p>标签封装文本,所以我不确定为什么这不起作用。谁能看到一个明显的原因,为什么我得到的只是使用此代码的页脚?

4

1 回答 1

2

里面有点乱parse_item()。这是从所有段落(p标签)中获取文本并加入它的固定版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self,response):
        paragraphs = response.selector.xpath("//p").extract()
        text = "".join(remove_tags(paragraph).encode('utf-8') for paragraph in paragraphs)
        print text

对于此页面,它打印:

"There is no budget, there is money. We are in a very strong financial position. We can make big signings." Music to the ears of Manchester United fans as vice-chairman Ed Woodward confirmed the club can make big-money acquisitions in this very transfer window. In a bid to return to the summit of England’s top tier, Woodward has effectively given the green light to a spending spree that has supporters rubbing their hands with glee. Ander Herrara and Luke Shaw have arrived for a combined £59m already this summer and the carousel through the Old Trafford entrance door shows no sign of slowing down. Ángel Di María, Mats Hummels and Daley Blind, amongst others, have all been linked with a move to United, while reports suggesting midfield pitbull Arturo Vidal is set to join Louis van Gaal’s side refuse to die down.  "I’m still on holiday at the moment. Can I say I’m staying at Juve? I don’t know. On Monday I’ll talk to (Juventus manager, Massimili
...
 Contact Us | About Us | Glossary | Privacy Policy | WhoScored Ratings
            Copyright © 2014 WhoScored.com
于 2014-07-26T00:28:24.210 回答