python - 使用从页面中抓取的字符串生成 start_urls 列表，以便使用 scrapy 进一步抓取

Question

请帮忙，

我从一个房地产网站的搜索结果页面中收集了大量与 porperty id 相对应的字符串。该站点使用属性 ID 来命名包含有关我要收集的各个属性的信息的页面。

如何将我的第一个蜘蛛创建的 url 列表放入另一个蜘蛛的 start_urls 中？

谢谢 - 我是新人。

score 1 · Accepted Answer

没必要养两只蜘蛛。蜘蛛可以使用自定义回调yield的scrapy.http.Request对象，以允许根据从初始页面集解析的值来抓取其他页面。

让我们看一个例子：

from scrapy.spider import BaseSpider
from scrapy.http import Request    

class SearchSpider(BaseSpider):
   ...
   start_urls = ['example.com/list_of_links.html']
   ...

   # Assume this is your "first" spider's parse method
   # It parses your initial search results page and generates a
   # list of URLs somehow.
   def parse(self, response):
     hxs = HtmlXPathSelector(response)
     # For example purposes we just take every link
     for href in hxs.select('//a/@href]).extract():
       yield Request(href[0], callback=self.parse_search_url)

   def parse_search_url(self, response):
      # Here is where you would put what you were thinking of as your
      # "second" spider's parse method. It operates on the results of the
      # URLs scraped in the first parse method.
      pass

正如您在此示例中看到的，SearchSpider.parse 方法解析“搜索结果页面”（或其他任何内容）并为它找到的每个 URL 生成一个请求。因此，不要将这些 URL 写入文件并尝试将它们用作第二个蜘蛛的 start_url，只需将它们与回调设置为同一蜘蛛中的另一个方法（此处为：parse_search_url）。

希望这可以帮助。

score 1 · Accepted Answer

作为一个菜鸟，我知道很难理解yieldScrapy 中的方法。如果您无法获得上述方法@audiodude 详细信息（由于多种原因，这是更好的抓取方法），我使用的“解决方法”是通过使用生成我的网址（在 LibreOffice 或 Excel 中）Concatenate函数为每一行添加正确的标点符号。然后只需将它们复制并粘贴到我的蜘蛛中，例如

start_urls = [
  "http://example.com/link1",
  "http://example.com/link2",
  "http://example.com/link3",
  "http://example.com/link4",
  "http://example.com/link5",
  "http://example.com/link6",
  "http://example.com/link7",
  "http://example.com/link8",
  "http://example.com/link9"
  ]

请注意，每行（最后一行除外）后都需要一个逗号，并且每个链接必须用直引号引起来。使用时使用引号会很痛苦Concatenate，因此要产生所需的结果，请在与您的 url 相邻的单元格中输入，=Concatenate(CHAR(34),A2,CHAR(34),",")假设您的 url 在 cell 中A2。

祝你好运。

python - 使用从页面中抓取的字符串生成 start_urls 列表，以便使用 scrapy 进一步抓取

2 回答 2

Related

Reference