python - python Scrapy CrawlSpider：登录后规则不适用，只爬取第一页

Question

我是一名 C/C++ 程序员，在制作绘图和文本处理方面具有有限的 python 经验。我目前正在从事个人数据分析项目，我正在使用 Scrapy 抓取论坛中的所有线程和用户信息。

我已经整理了一个初始代码，旨在首先登录，然后从子论坛的索引页面开始，执行以下操作：

1）提取所有包含“主题”的主题链接

2）暂时将页面保存在一个文件中（一旦整个过程开始，将提取项目信息）

3）找到标签class=next的下一页链接，转到下一页并重复1）和2）

我知道对于每个线程，我仍然需要浏览所有包含所有回复帖子的页面，但我计划在我当前的代码工作正确后执行此操作。

但是，我当前的代码只会提取起始 url 中的所有线程，然后停止。我已经搜索了几个小时，但没有找到任何解决方案。所以我在这里问我的问题，希望有 Scrapy 经验的人可以在这里帮助我。如果你们想要任何其他信息，例如输出，请告诉我。谢谢！

关于保罗的回复，我更新了我的代码，我的链接提取器有问题，我需要修复它。除此之外，该规则现在可以正常工作。再次感谢保罗的帮助。

这是我当前的蜘蛛代码：

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector

class ZhuaSpider(CrawlSpider):
    name = 'zhuaspider'
    allowed_domains = ['depressionforums.org']
    login_page = 'http://www.domain.com/forums/index.php?app=core&module=global&section=login'
    start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']

    rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
                           callback='parse_links',
                           follow=True),
            )

    def start_requests(self):
        """called before crawling starts. Try to login"""
        yield Request(
                url=self.login_page,
                callback=self.login,
                dont_filter=True)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                formdata={'ips_username': 'myuid', 'ips_password': 'mypwd'},
                callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are successfully logged in."""
        if "Username or password incorrect" in response.body:
            self.log("Login failed.")
        else:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin.
            for url in self.start_urls:
                # explicitly ask Scrapy to run the responses through rules
                yield Request(url, callback=self.parse)

    def parse_links(self, response):
        hxs = Selector(response)
        links = hxs.xpath('//a[contains(@href, "topic")]')
        for link in links:
            title = ''.join(link.xpath('./@title').extract())
            url = ''.join(link.xpath('./@href').extract())
            meta={'title':title,}
            yield Request(url, callback = self.parse_posts, meta=meta,)

    #If I add this line it will only crawl the starting url,
    #otherwise it still won't apply the rule and crawls nothing.
    parse_start_url = parse_links

    def parse_posts(self, response):
        filename = 'download/'+ response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

score 2 · Accepted Answer

要使用CrawlSpider's Rules，您需要Requests通过 internalparse()方法处理。

您可以通过显式设置callback=self.parse或不设置回调来做到这一点。

start_urls = ['http://www.depressionforums.org/forums/forum/12-depression-central/']

rules = (
    Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]'), unique=True),
                           callback='parse_links',
                           follow=True),
)

...

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in."""
    if "Username or password incorrect" in response.body:
        self.log("Login failed.")
    else:
        self.log("Successfully logged in. Let's start crawling!")
        # Now the crawling can begin.
        for url in self.start_urls:
            # explicitly ask Scrapy to run the responses through rules
            yield Request(url, callback=self.parse)

然后，仅凭这一点，您应该会看到//li[@class="next"]部分中的链接页面正在使用...进行爬网和解析parse_links()......期望 start_urls 本身。

要通过parse_linksstart_urls，您必须定义一个特殊parse_start_url属性。

你可以这样做：

def parse_links(self, response):
    hxs = Selector(response)
    links = hxs.xpath('//a[contains(@href, "topic")]')
    for link in links:
        title = ''.join(link.xpath('./@title').extract())
        url = ''.join(link.xpath('./@href').extract())
        meta={'title':title,}
        yield Request(url, callback = self.parse_posts, meta=meta,)

parse_start_url = parse_links

python - python Scrapy CrawlSpider：登录后规则不适用，只爬取第一页

1 回答 1

Related

Reference