2

在收到几个响应后请求多个 URL 时,它开始给我其他 URL 的 403 错误。

我尝试使用用户代理和代理仍然存在问题。我还尝试了 0.5 秒的延迟。

我正在使用 - 请求版本 = 2.22.0

这是它的样子

这是 (r.status_code, r.headers, r.text) 的样子:

403 {'Allow': 'GET, POST, HEAD, PUT, PATCH, DELETE, OPTIONS', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html; charset=UTF-8', 'Accept-Ranges': 'bytes, bytes, bytes, bytes', 'Content-Length': '1519', 'Date': 'Thu, 06 Feb 2020 10:34:40 GMT', 'Connection': 'keep-alive', 'set-cookie': 'machine_cookie=9581501972230; expires=Wed, 05 Feb 2025 10:34:40 GMT; path=/;', 'X-Served-By': 'cache-sea4466-SEA, cache-maa18327-MAA', 'X-Cache': 'MISS, MISS', 'X-Cache-Hits': '0, 0', 'X-Timer': 'S1580985280.913451,VS0,VE312', 'Vary': 'User-Agent, Accept-Encoding'} <!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Access to this page has been denied.</title>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet">
  <style>
    html, body {
      margin: 0;
      padding: 0;
      font-family: 'Open Sans', sans-serif;
      color: #000;
    }

    .container {
      align-items: center;
      display: flex;
      flex: 1;
      justify-content: space-between;
      flex-direction: column;
      height: 100%;
    }

    .container > div {
      width: 100%;
      display: flex;
      justify-content: center;
    }

    .container > div > div {
      display: flex;
      width: 80%;
    }

    .customer-logo-wrapper {
      padding-top: 2rem;
      flex-grow: 0;
      background-color: #fff;
    }

    .customer-logo {
      border-bottom: 1px solid #000;
    }

    .customer-logo > img {
      padding-bottom: 1rem;
      max-height: 50px;
      max-width: 100%;
    }

    .page-title-wrapper {
      flex-grow: 0;  /* was 2, but that pushed it too far down the page */
    }

    .page-title {
      flex-direction: column-reverse;
    }

    .content-wrapper {
      flex-grow: 5;
    }

    .content {
      flex-direction: column;
    }

    @media (min-width: 768px) {
      html, body {
        height: 100%;
      }
    }
  </style>
  <script>
    window._pxAppId = 'PXxgCxM9By';
    window._pxJsClientSrc = '/xgCxM9By/init.js';
    window._pxHostUrl = '/xgCxM9By/xhr';

    startTime = Date.now();
    window._pxOnCaptchaSuccess = function(isValid){
      var solutionTime = Math.floor((Date.now() - startTime) / 1000);
      var reload = function(){ top.location.reload(); };
      sendEvent("captcha/solved?px_uuid=" + window._pxUuid + "&time_to_solution=" + solutionTime + '&isValid=' + isValid, reload);
      setTimeout(reload, 700);
    };

    function sendEvent(event, onload){
      var xhr = new XMLHttpRequest();
      xhr.open("GET", "/_sa_track/" + event);
      if (onload) xhr.addEventListener("load", onload);
      xhr.send();
    }
  </script>
<script type="text/javascript">window._pxVid = "";window._pxUuid = "47a70d80-48cc-11ea-860b-c96869955a6b";</script></head>
<body>
<section class="container">
  <div class="page-title-wrapper">
    <div class="page-title">
      <h1>Please click “I am not a robot” to continue</h1>
    </div>
  </div>
  <div class="content-wrapper">
    <div class="content">
      <div id="px-captcha"></div>
      <p></p>
      <p>
        To ensure this doesn’t happen in the future, please enable Javascript and cookies in your browser.<br/>
        Is this happening to you frequently? Please <a href="https://seekingalpha.userecho.com?source=captcha">report it on our feedback forum</a>.
      </p>
      <p>
        If you have an ad-blocker enabled you may be blocked from proceeding. Please disable your ad-blocker and refresh.
      </p>
      <p>Reference ID: <span id="refid"></span></p>
    </div>
  </div>
  <script>
    document.getElementById("refid").innerHTML = window._pxUuid;
    sendEvent("captcha/shown?px_uuid=" + window._pxUuid);
  </script>
</section>

<script src="/xgCxM9By/captcha/PXxgCxM9By/captcha.js?a=c&m=0"></script>

</body>
</html>
4

1 回答 1

1

服务器通过显示 HTTP 状态代码和验证码来防止您获取所需的信息,403 Forbidden以确保请求是由人发起的,而不是由 Python 脚本发起的。远程服务很可能会暂时禁止您的会话或您的 IP 地址。

有一些解决方法可以避免这种服务器禁令,但不能保证您可以克服该限制

所以我只能给你一些建议:

  1. 最好使用Session而不是一次性请求,因为它保留了请求之间的状态。
  2. 像浏览器一样使用 User-Agent。
  3. 适度增加请求之间的冷却时间。
  4. 代理也可以被远程服务器禁止(通常基于其 IP),所以有时在循环模式下使用多个代理是个好主意。
  5. 您的主要目标是使您的请求看起来像来自普通浏览器的请求。您可以在开发人员选项卡中检查从浏览器到远程服务器的请求。尝试复制浏览器的行为。
于 2020-02-06T12:26:29.473 回答