0

我正在尝试抓取使用 Ajax 分页的https://wegotthiscovered.com/reviews/ 。我尝试了所有方法,但它没有返回或返回 http-status 代码 400。任何人都可以帮助解决这个问题吗?

import json
import scrapy
from..items import xyzItem

class MySpider(scrapy.Spider):
    name = 'abc'
    data = {"id":"infinite_scroll_1","order":"","orderby":"","catnames":"reviews","postnotin":"900303,899404,898188,897386,896672,893944,895290,895136,892571,892412,891795,887847","timestampbefore":'1589354802'}
    headers = {"content-type": "application/json"}
    url = 'https://wegotthiscovered.com/wp-admin/admin-ajax.php'

    def start_requests(self):
        yield scrapy.Request(
            url=self.url,
            method='POST',
            body=json.dumps(self.data),
            headers=self.headers,
            meta={'index': 0}
        )

    def parse(self, response):
        items = xyzItem()
        i = 1
        movie_title = response.css('h4').css('::text').getall() 
        # movie_text = response.css('.summary').xpath('text()').getall() 
        movie_id = response.css('h4').css('::attr(href)').getall()   


        li = items['movie_title']
        for i in range(len(li)):
            li_split =  li[i].split(" ")
            #print(movietitle)
            #if 'Review:' in li_split or 'review:' in li_split or 'Review' in li_split or 'review' in li_split:
            outputs = DeccanchronicleItem()
            outputs['page_title'] = li[i]
            # outputs['review_content'] = items['movie_text'][i]
            outputs['review_link'] = items['movie_id'][i]
            yield outputs

        page = response.meta['index'] + 1
        self.data['index'] = page
        yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})
4

1 回答 1

1

您的代码的主要问题是您没有使用正确的request

class MySpider(scrapy.Spider):
    name = 'wegotthiscovered'
    data = {
        "id":"infinite_scroll_1",
        "order":"",
        "orderby":"",
        "catnames":"reviews",
        "postnotin":"900303,899404,898188,897386,896672,893944,895290,895136,892571,892412,891795,887847",
        "timestampbefore":'1589363845'
    }
    headers = {
        "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "x-requested-with": "XMLHttpRequest",
        'referer': "https://wegotthiscovered.com/reviews/",
    }
    url = 'https://wegotthiscovered.com/wp-admin/admin-ajax.php'
    start_urls = ['https://wegotthiscovered.com/reviews/'] # I used this to get cookies BEFORE POST request

    def parse(self, response):
        yield scrapy.FormRequest(
            url=self.url,
            method='POST',
            callback=self.parse_search,
            formdata={
                'page': '2',
                'action': 'face3_infinite_scroll',
                'attrs': json.dumps(self.data),
            }
            ,
            headers=self.headers,
            meta={'index': 0}
        )

    def parse_search(self, response):
        items = xyzItem()
        i = 1
        movie_title = response.css('h4').css('::text').getall() 
        # movie_text = response.css('.summary').xpath('text()').getall() 
        movie_id = response.css('h4').css('::attr(href)').getall()   


        li = items['movie_title']
        for i in range(len(li)):
            li_split =  li[i].split(" ")
            #print(movietitle)
            #if 'Review:' in li_split or 'review:' in li_split or 'Review' in li_split or 'review' in li_split:
            outputs = DeccanchronicleItem()
            outputs['page_title'] = li[i]
            # outputs['review_content'] = items['movie_text'][i]
            outputs['review_link'] = items['movie_id'][i]
            yield outputs

        page = response.meta['index'] + 1
        self.data['index'] = page
        yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})

顺便说一句,您的解析部分将不起作用,因为您需要处理 JSON 响应(从中解析“html”部分)。

更新一切都在我身边(HTML包含电影列表):

2020-05-16 00:20:23 [scrapy.core.engine] INFO: Spider opened
2020-05-16 00:20:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-16 00:20:23 [wegotthiscovered] INFO: Spider opened: wegotthiscovered
2020-05-16 00:20:23 [wegotthiscovered] INFO: Spider opened: wegotthiscovered
2020-05-16 00:20:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-16 00:20:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://wegotthiscovered.com/reviews/> (referer: None)
2020-05-16 00:20:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://wegotthiscovered.com/wp-admin/admin-ajax.php> (referer: https://wegotthiscovered.com/reviews/)

要么你的 IP 被禁止,要么你不运行我的代码。

于 2020-05-13T11:11:14.303 回答