scrapy - Scrapy - 下载 response.body 时的不同页面内容

Question

我正在尝试解析页面，例如www.page.com/results?sort=price. 我用这段代码解析它：

def start_requests(self):
    start_urls = [
        "www.page.com/results?sort=price",
    ]
    for url in start_urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

    # some code

    next_page = "www.page.com/results?sort=price&type=12"
    yield response.follow(next_page, self.get_models)

def get_models(self, response):
    f = open('/tmp/test/file1.txt', 'w')
    f.write(response.url)
    f.write(response.body.decode('utf-8'))
    f.close()

输出文件与此代码生成的文件不同：

def start_requests(self):
    start_urls = [
        "www.page.com/results?sort=price&type=12",
    ]
    for url in start_urls:
        yield scrapy.Request(url=url, callback=self.get_models)

def get_models(self, response):
    f = open('/tmp/test/file2.txt', 'w')
    f.write(response.url)
    f.write(response.body.decode('utf-8'))
    f.close()

当我通过下载页面时scrapy shell 'www.page.com/results?sort=price&type=12'，输出类似于file2.txt. 问题是，在 file1.txt 中，没有我需要抓取的数据标签。这两种爬取页面的方式有什么区别，为什么下载的文件不一样？

score 0 · Accepted Answer

我认为在第二种情况下，您会访问错误的网址。检查您的日志以确保。我不确定response.follow是如何工作的。我看不出有任何理由在这里使用它，因为您使用的是完整的 URL（不仅仅是路径）。尝试将其更改为简单Request

def parse(self, response):

    # some code

    next_page = "www.page.com/results?sort=price&type=12"
    yield scrapy.Request(next_page, self.get_models)

scrapy - Scrapy - 下载 response.body 时的不同页面内容

1 回答 1

Related

Reference