python - scrapy 仅通过第一个链接

Question

我一般是scrapy和python的新手，我正在尝试制作一个从页面中提取链接然后编辑这些链接然后遍历它们中的每一个的scraper。我正在使用scrapy的剧作家。

这就是我所在的位置，但由于某种原因，它只刮掉了第一个链接。

 def parse(self, response):
        for link in response.css('div.som a::attr(href)'):
            yield response.follow(link.get().replace('docs', 'www').replace('com/', 'com/#'),
                                  cookies={'__utms': '265273107'},
                                  meta=dict(
                                      playwright=True,
                                      playwright_include_page=True,
                                      playwright_page_coroutines=[
                                          PageCoroutine('wait_for_selector', 'span#pple_numbers')]
                                  ),
                                  callback=self.parse_c)

    async def parse_c(self, response):
        yield {
            'text': response.css('div.pple_numb span::text').getall()

score 0 · Accepted Answer

如果您可以添加有关您尝试获取的数据的更多详细信息，那就太好了。因此，您能否添加指示的行以查看它是否通过不同的链接？

 def parse(self, response):
        for link in response.css('div.som a::attr(href)'):
            print(link) <--- //could you add this line to check if prints all the links?

score 0 · Accepted Answer

根据文档，有两个功能follow：

跟随：

返回一个 Request 实例以跟随链接 url。它接受与方法相同的参数Request.__init__，但 url 不仅可以是绝对 URL，还可以是相对 URL、链接对象，例如链接提取器的结果，...

遵守所有

生成请求实例以跟踪 url 中的所有链接的生成器。它接受与方法相同的参数Request’s __init__，除了每个 urls 元素不需要是绝对 URL，它可以是以下任何一种：相对 URL、链接对象，例如链接提取器的结果，...

可能如果您尝试使用代码follow_all而不是仅使用follow它应该可以解决问题。

python - scrapy 仅通过第一个链接

2 回答 2

Related

Reference