我正在尝试抓取一个需要登录的网站,核心数据是用 javascript 和 XHR 文件呈现的。我正在使用该html-requests
库,但该render()
功能似乎对网页没有影响。这是我的代码:
import requests_html as requests
import bs4 as bs
# variables...
# def createForm()...
with requests.HTMLSession() as session:
print("retrieving page...")
initial_response = session.get(login_url)
print("logging in...")
response = session.post(url = login_url, data = createForm(initial_response))
page_html = session.get(target_url)
page = bs.BeautifulSoup(page_html.content, 'lxml')
html_before = page.prettify()
print('rendering...')
page_html.html.render(sleep = 5)
page_rendered = bs.BeautifulSoup(page_html.content, 'lxml')
html_after = page_rendered.prettify()
if html_before == html_after:
print("they are the same")
这是返回的 html(重要位):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Home | Compass
</title>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<script src="/cdn-cgi/apps/head/nXBUbHOMoxcWCnUqQqrCuyGGJ4s.js">
</script>
Boring CSS...
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
</head>
<body class="greyBody">
Dull JSON...
Compass.assemblyVersion = "11.44.1.0";Compass.isDev = false;Compass.organisationUserId = 1921;Compass.organisationUserSussiId = "SAGE.ALLEN";Compass.organisationUserBaseRole = 1;Compass.organisationUserRoles = { "AfterHoursAccess": true, "MyFilesBase": true, "StaffStudentsMisc": true, "StudentsMisc": true};Compass.schoolId = "shenton.wa.edu.au";Compass.schoolName = "Shenton College";Compass.schoolPrimaryFqdn = "shenton-wa.compass.education";Compass.headAncestorId = "shenton.wa.edu.au";Compass.hasChildOrganisations = false;Compass.isInHierarchy = false;Compass.isTargetingAncestry = false;
</script>
<a href="/Communicate/Documentation/Help.aspx" style="position: absolute; left: -999px">
Help
</a>
<form action="./" id="aspnetForm" method="post">
<div class="aspNetHidden">
<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>
<input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/>
<input id="__VIEWSTATE" name="__VIEWSTATE" type="hidden" value="cUphbVG2sD46yFu7rFLU15w0eiJn+7KXkA6I6Cg/7RQ9m3rwlz5poc6KdcuOMApzHcafUPq70DbpviYl6V7vDYHgLMx23YF8OtMtdcmxVSk="/>
</div>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
theForm = document.aspnetForm;
}
}heForm.submit();GUMENT.value = eventArgument;= false)) {
}
//]]>
</script>
<script src="/WebResource.axd?d=pynGkmcFUV13He1Qd6_TZMBp8pi1aG3kj_Rrf_NckYpQU5qPM8p1FZ-Rik-uln5rcqPDnR_gxYalKXvDaBNyhg2&t=636165368714134089" type="text/javascript">
</script>
<script src="/ScriptResource.axd?d=NJmAwtEo3Ipnlaxl6CMhvk3jMxAVfdhwj8EfOKm3TxozcZHxkgtaPL9w9WaPcaq30sskp_Glm4jiP922KJP1an86NqAQUdSFO5rhKIKoAuO5v3uoNlAezbrUkCluOH1LV_F9OB_HI13vUK6I2eQlLQ80jzjIESOQbg5oZuzg3A01&t=ffffffffd416f7fc" type="text/javascript">
</script>
<script src="/ScriptResource.axd?d=dwY9oWetJoJoVpgL6Zq8OJy60eKvb9zs3HNOFEuh2HK-a1JlTWrINdUt4GmfnVpd-vC-hGQfNOA-_hpGAIQxLJ6TRvLcoTQZ7vzC5ouXwZ7EB1Rqgo_p4dWNsoX1AAW-I0gKht_6IBwAHOTP4LV38H7v4PjwKJBs7h2NgozR47s1&t=ffffffffd416f7fc" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/4ed8095_javascript-resource-manager.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/62cdce8_utility.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/7c66c7e_ravenjs-loader.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ext-js4.2.2/5de6c0f_ext-all.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ce7ba4b_jquery-1.8.3.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/81a11e3_autosuggest-widget.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ef94bb5_jquery-json-2.3.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/5fee56b_jquery.elastic.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/b9aa653_jquery.simplemodal.1.4.3.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/moment/cdeefcf_moment-and-data.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ext-js4.2.2/resources/js/8f9f704_ext-extensions-and-theme.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/d3ac6df_impersonate-widget.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/0d786b6_compass.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/bb8b963_request-capture.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/17975e6_external-resource-monitor.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ckeditor/0d3caaa_ckeditor.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Calendar/Scripts/625070f_calendar-and-extensions.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/PageScripts/583fb06_HomePage.Chronicle.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/PageScripts/9571a67_HomePage.min.js" type="text/javascript">
</script>
<script type="text/javascript">
//<![CDATA[
Sys.WebForms.PageRequestManager._initialize('ctl00$ctl04', 'aspnetForm', [], [], [], 90, 'ctl00');
//]]>
</script>
<script type="text/javascript">
});ersonateWindow.show();xt.create('Compass.widgets.ImpersonateWidget',
{});, function(e) {ggestions?sessionstate=readonly",
Unremarkable HTML...
Encrypted:6SeqWGbwjzN6ZfMnVVAU1sXLmHGC06o6K+A6lAhEBFCLQgeQq6ZU810mqSzy0zNyMwUhKnrAlYfvvlTuy5xpIj4OkW4pGBLFN6PVai3RoevYQkgbvy9vqVBanzrNVfRGsMIE8kgq+8pJGtNiCveqQAvzLfhgHhm5QQ8/k4ShskzjZdRPX9MUNpa-->kHWHQOxCM73dFIgYrWM6PexC+wA31RdtyPTEp7gRCb7ulIlQFSKresH2xPmdHNeLhA7mCefNrbBDMG7eJ5kqhLsh3QqbxMQ1IABdA42nGGSdw1GFkmRJYS06mNS4Cjp44cmQBt
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/e0c3e6b_LazyLoad.min.js" type="text/javascript">
</script>
<script type="text/javascript">
</script>
<div class="aspNetHidden">
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="CA0B0334"/>
</div>
</form>
</body>
</html>
我没有设法破译所有脚本,因为我没有使用 javascript 的经验,尽管它们似乎正在获取数据。任何关于为什么这些脚本没有运行或任何替代解决方案(足够快)的解释都值得赞赏。