4

我正在尝试抓取一个需要登录的网站,核心数据是用 javascript 和 XHR 文件呈现的。我正在使用该html-requests库,但该render()功能似乎对网页没有影响。这是我的代码:

import requests_html as requests
import bs4 as bs

# variables...

# def createForm()...

with requests.HTMLSession() as session:
            print("retrieving page...")
            initial_response = session.get(login_url)

            print("logging in...")
            response = session.post(url = login_url, data = createForm(initial_response))
            page_html = session.get(target_url)

            page = bs.BeautifulSoup(page_html.content, 'lxml')
            html_before = page.prettify()

            print('rendering...')
            page_html.html.render(sleep = 5)

            page_rendered = bs.BeautifulSoup(page_html.content, 'lxml')
            html_after = page_rendered.prettify()

            if html_before == html_after:
                print("they are the same")

这是返回的 html(重要位):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   Home | Compass
  </title>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <script src="/cdn-cgi/apps/head/nXBUbHOMoxcWCnUqQqrCuyGGJ4s.js">
  </script>

    Boring CSS...

  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
 </head>
 <body class="greyBody">

    Dull JSON...

Compass.assemblyVersion = "11.44.1.0";Compass.isDev = false;Compass.organisationUserId = 1921;Compass.organisationUserSussiId = "SAGE.ALLEN";Compass.organisationUserBaseRole = 1;Compass.organisationUserRoles = { "AfterHoursAccess": true, "MyFilesBase": true, "StaffStudentsMisc": true, "StudentsMisc": true};Compass.schoolId = "shenton.wa.edu.au";Compass.schoolName = "Shenton College";Compass.schoolPrimaryFqdn = "shenton-wa.compass.education";Compass.headAncestorId = "shenton.wa.edu.au";Compass.hasChildOrganisations = false;Compass.isInHierarchy = false;Compass.isTargetingAncestry = false;
  </script>
  <a href="/Communicate/Documentation/Help.aspx" style="position: absolute; left: -999px">
   Help
  </a>
  <form action="./" id="aspnetForm" method="post">
   <div class="aspNetHidden">
    <input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>
    <input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/>
    <input id="__VIEWSTATE" name="__VIEWSTATE" type="hidden" value="cUphbVG2sD46yFu7rFLU15w0eiJn+7KXkA6I6Cg/7RQ9m3rwlz5poc6KdcuOMApzHcafUPq70DbpviYl6V7vDYHgLMx23YF8OtMtdcmxVSk="/>
   </div>
   <script type="text/javascript">
    //<![CDATA[
var theForm = document.forms['aspnetForm'];
theForm = document.aspnetForm;
}
}heForm.submit();GUMENT.value = eventArgument;= false)) {
}
//]]>
   </script>
   <script src="/WebResource.axd?d=pynGkmcFUV13He1Qd6_TZMBp8pi1aG3kj_Rrf_NckYpQU5qPM8p1FZ-Rik-uln5rcqPDnR_gxYalKXvDaBNyhg2&amp;t=636165368714134089" type="text/javascript">
   </script>
   <script src="/ScriptResource.axd?d=NJmAwtEo3Ipnlaxl6CMhvk3jMxAVfdhwj8EfOKm3TxozcZHxkgtaPL9w9WaPcaq30sskp_Glm4jiP922KJP1an86NqAQUdSFO5rhKIKoAuO5v3uoNlAezbrUkCluOH1LV_F9OB_HI13vUK6I2eQlLQ80jzjIESOQbg5oZuzg3A01&amp;t=ffffffffd416f7fc" type="text/javascript">
   </script>
   <script src="/ScriptResource.axd?d=dwY9oWetJoJoVpgL6Zq8OJy60eKvb9zs3HNOFEuh2HK-a1JlTWrINdUt4GmfnVpd-vC-hGQfNOA-_hpGAIQxLJ6TRvLcoTQZ7vzC5ouXwZ7EB1Rqgo_p4dWNsoX1AAW-I0gKht_6IBwAHOTP4LV38H7v4PjwKJBs7h2NgozR47s1&amp;t=ffffffffd416f7fc" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/System/Scripts/4ed8095_javascript-resource-manager.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/62cdce8_utility.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/System/Scripts/7c66c7e_ravenjs-loader.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ext-js4.2.2/5de6c0f_ext-all.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ce7ba4b_jquery-1.8.3.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/81a11e3_autosuggest-widget.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ef94bb5_jquery-json-2.3.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/5fee56b_jquery.elastic.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/b9aa653_jquery.simplemodal.1.4.3.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/moment/cdeefcf_moment-and-data.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ext-js4.2.2/resources/js/8f9f704_ext-extensions-and-theme.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/d3ac6df_impersonate-widget.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/0d786b6_compass.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/System/Scripts/bb8b963_request-capture.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/System/Scripts/17975e6_external-resource-monitor.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ckeditor/0d3caaa_ckeditor.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/Calendar/Scripts/625070f_calendar-and-extensions.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/PageScripts/583fb06_HomePage.Chronicle.min.js" type="text/javascript">
   </script>
   <script src="https://assets.compass.education/StaticAssetsK/PageScripts/9571a67_HomePage.min.js" type="text/javascript">
   </script>
   <script type="text/javascript">
    //<![CDATA[
Sys.WebForms.PageRequestManager._initialize('ctl00$ctl04', 'aspnetForm', [], [], [], 90, 'ctl00');
//]]>

       </script>
       <script type="text/javascript">
    });ersonateWindow.show();xt.create('Compass.widgets.ImpersonateWidget', 

{});, function(e) {ggestions?sessionstate=readonly", 

Unremarkable HTML...

Encrypted:6SeqWGbwjzN6ZfMnVVAU1sXLmHGC06o6K+A6lAhEBFCLQgeQq6ZU810mqSzy0zNyMwUhKnrAlYfvvlTuy5xpIj4OkW4pGBLFN6PVai3RoevYQkgbvy9vqVBanzrNVfRGsMIE8kgq+8pJGtNiCveqQAvzLfhgHhm5QQ8/k4ShskzjZdRPX9MUNpa-->kHWHQOxCM73dFIgYrWM6PexC+wA31RdtyPTEp7gRCb7ulIlQFSKresH2xPmdHNeLhA7mCefNrbBDMG7eJ5kqhLsh3QqbxMQ1IABdA42nGGSdw1GFkmRJYS06mNS4Cjp44cmQBt
           <script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/e0c3e6b_LazyLoad.min.js" type="text/javascript">
           </script>
           <script type="text/javascript">
           </script>
           <div class="aspNetHidden">
            <input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="CA0B0334"/>
           </div>
          </form>
         </body>
        </html>

我没有设法破译所有脚本,因为我没有使用 javascript 的经验,尽管它们似乎正在获取数据。任何关于为什么这些脚本没有运行或任何替代解决方案(足够快)的解释都值得赞赏。

4

0 回答 0