scrapy - Scrapy：（400 Bad Request）HTTP状态码未处理或不允许

Question

因为我是python和scrapy的新手。我一直在尝试抓取一个 URL 碎片化的网站。我正在发出发布请求以获取响应，但不幸的是它没有让我得到结果。

    def start_requests(self):
    try:
        form = {'menu': '6'
            , 'browseby': '8'
            , 'sortby': '2'
            , 'media': '3'
            , 'ce_id': '1428'
            , 'ot_id': '19999'
            , 'marker': '354'
            , 'getpage': '1'}

        head = {
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            # 'Content-Length': '78',
            # 'Host': 'onlinelibrary.ectrims-congress.eu',
            # 'Accept-Encoding': 'gzip, deflate, br',
            # 'Connection': 'keep-alive',
            'XMLHttpRequest':'XMLHttpRequest',
        }

        urls = [
            'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/conferences'
            ]

        request_body = urllib.parse.urlencode(form)
        print(request_body)
        print(type(request_body))

        for url in urls:
            req = Request(url=url, body= request_body, method='POST', headers=head,callback=self.parse)
            req.headers['Cookie'] = 'js_enabled=true; is_cookie_active=true;'

            yield req

    except Exception as e:
        print('the error is {}'.format(e))

我收到一个不断的错误

[scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <POST https://onlinelibrary.ectrims-congress.eu/ectrims/listing/conferences> (failed 4 times): 400 Bad Request

当我试图让邮递员检查相同的内容时，我得到了预期的输出。有人可以帮我解决这个问题。

score 0 · Accepted Answer

尝试使用FormRequest而不是Request.

https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.FormRequest

score 0 · Accepted Answer

如果你想使用Request来发送POST请求，你将不得不使用json.dumps()来转换dictionary为string.

这是一个有效的解决方案：

import scrapy
    
class EventsSpider(scrapy.Spider):
    name = 'events'

    def start_requests(self):
        form = {'menu': '6', 'browseby': '8', 'sortby': '2', 'media': '3', 'ce_id': '1428', 'ot_id': '19999', 'marker': '354', 'getpage': '1'}

        head = {
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'XMLHttpRequest': 'XMLHttpRequest',
        }

        url = 'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/conferences'
        request_body = json.dumps(form)
        req = scrapy.Request(url=url, body=request_body, method='POST', headers=head, callback=self.parse)
        yield req

    def parse(self, response):
        print(response.json().keys())

输出：

dict_keys(['html', 'type', 'debug', 'total_pages', 'current_page', 'total_items', 'login'])

额外提示：如果您可以在 Postman 中使用它，您可以单击</>右侧面板上的代码按钮。requests如果您选择 Python，您将使用库生成代码。

scrapy - Scrapy：（400 Bad Request）HTTP状态码未处理或不允许

2 回答 2

Related

Reference