我正在尝试解析页面,例如www.page.com/results?sort=price
. 我用这段代码解析它:
def start_requests(self):
start_urls = [
"www.page.com/results?sort=price",
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# some code
next_page = "www.page.com/results?sort=price&type=12"
yield response.follow(next_page, self.get_models)
def get_models(self, response):
f = open('/tmp/test/file1.txt', 'w')
f.write(response.url)
f.write(response.body.decode('utf-8'))
f.close()
输出文件与此代码生成的文件不同:
def start_requests(self):
start_urls = [
"www.page.com/results?sort=price&type=12",
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.get_models)
def get_models(self, response):
f = open('/tmp/test/file2.txt', 'w')
f.write(response.url)
f.write(response.body.decode('utf-8'))
f.close()
当我通过下载页面时scrapy shell 'www.page.com/results?sort=price&type=12'
,输出类似于file2.txt
. 问题是,在 file1.txt 中,没有我需要抓取的数据标签。这两种爬取页面的方式有什么区别,为什么下载的文件不一样?