python - Scrapy CrawlSpider + Splash：如何通过链接提取器跟踪链接？

Question

我有以下部分工作的代码，

class ThreadSpider(CrawlSpider):
    name = 'thread'
    allowed_domains = ['bbs.example.com']
    start_urls = ['http://bbs.example.com/diy']

    rules = (
        Rule(LinkExtractor(
            allow=(),
            restrict_xpaths=("//a[contains(text(), 'Next Page')]")
        ),
            callback='parse_item',
            process_request='start_requests',
            follow=True),
    )

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

def parse_item(self, response):
    # item parser

代码将只运行start_urls但不会遵循中指定的链接restricted_xpaths，如果我注释掉规则中的start_requests()方法和行process_request='start_requests',，它将按预期运行并遵循链接，当然没有js渲染。

我已经阅读了两个相关的问题，CrawlSpider with Splash 在第一个 URL和CrawlSpider with Splash后卡住，并专门更改scrapy.Request()为方法，但这似乎不起作用。我的代码有什么问题？谢谢，SplashRequest()start_requests()

score 3 · Accepted Answer

我有一个类似的问题，似乎特定于将 Splash 与 Scrapy CrawlSpider 集成。它只会访问开始 url，然后关闭。我设法让它工作的唯一方法是不使用scrapy-splash插件，而是使用'process_links'方法将Splash http api url添加到所有scrapy收集的链接。然后我做了其他调整，以弥补这种方法产生的新问题。这是我所做的：

如果您打算将其存储在某个地方，您需要这两个工具来将启动 url 放在一起，然后将其拆开。

from urllib.parse import urlencode, parse_qs

在每个链接前面都添加了启动 URL，scrapy 会将它们全部过滤为“场外域请求”，因此我们将“localhost”设置为允许的域。

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']

然而，这带来了一个问题，因为当我们只想抓取一个站点时，我们最终可能会无休止地抓取网络。让我们用 LinkExtractor 规则来解决这个问题。通过仅从我们想要的域中抓取链接，我们可以解决异地请求问题。

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',

这是 process_links 方法。urlencode 方法中的字典是放置所有启动参数的地方。

def process_links(self, links):
    for link in links:
        if "http://localhost:8050/render.html?&" not in link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links

最后，要将 url 从启动 url 中取出，请使用 parse_qs 方法。

parse_qs(response.url)['url'][0]

关于这种方法的最后一点说明。你会注意到我在开头的初始 URL 中有一个“&”。(...render.html? & )。这使得在使用 urlencode 方法时，无论参数的顺序如何，解析启动 url 以取出实际的 url 都是一致的。

score 2 · Accepted Answer

似乎与https://github.com/scrapy-plugins/scrapy-splash/issues/92有关

我个人使用 dont_process_response=True 所以响应是 HtmlResponse（这是 _request_to_follows 中的代码所必需的）。

我还在我的 spyder 中重新定义了 _build_request 方法，如下所示：

def _build_request(self, rule, link):
    r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
    r.meta.update(rule=rule, link_text=link.text)
    return r

在 github 问题中，一些用户只是在他们的类中重新定义了 _request_to_follow 方法。

score 0 · Accepted Answer

使用下面的代码 - 只需复制和粘贴

restrict_xpaths=('//a[contains(text(), "Next Page")]')

代替

restrict_xpaths=("//a[contains(text(), 'Next Page')]")

python - Scrapy CrawlSpider + Splash：如何通过链接提取器跟踪链接？

3 回答 3

Related

Reference