python 代码在page = webclient.getPage("https://www.gartner.com/en/newsroom")
.
我从http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/修改了 gartner.py 脚本,只是为了快速将 jython 代码调整为当前网站版本(请参阅测试时有效的 xpath在硒中):
from com.gargoylesoftware.htmlunit import WebClient as WebClient
from com.gargoylesoftware.htmlunit import BrowserVersion as BrowserVersion
def main():
webclient = WebClient(BrowserVersion.BEST_SUPPORTED) # creating a new webclient object.
page = webclient.getPage("https://www.gartner.com/en/newsroom") # getting the url
articles = page.getByXPath("//div[@class='row newsletter']//a") # getting all the hyperlinks
for article in articles:
print ("Clicking on:", article)
subpage = article.click() # click on the article link
title = subpage.getByXPath("//div[@class='globalsite cmp-globalsite-columncontrol aem-GridColumn aem-GridColumn--default--12']//*[@class='grid-norm mg-t0']") # get title
summary = subpage.getByXPath("//div[@class='globalsite cmp-globalsite-columncontrol aem-GridColumn aem-GridColumn--default--12']//*[@class='grid-norm subtitle mg-t15 mg-b15']") # get summary
print(title)
print(summary)
if __name__ == '__main__':
main()
这就是我得到的:
C:\Users\xyz>jython C:\Users\xyz\Desktop\gartner2.py
com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl WARNING Obsolete content type encountered: 'text/javascript'.
...some CSS errors...
com.gargoylesoftware.htmlunit.javascript.DefaultJavaScriptErrorListener SEVERE Error during JavaScript execution
Traceback (most recent call last):
File "C:\Users\xyz\Desktop\gartner2.py", line 30, in <module>
main()
File "C:\Users\xyz\Desktop\gartner2.py", line 17, in main
page = webclient.getPage("https://www.gartner.com/en/newsroom") # getting the url
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "then" of undefined (https://www.gartner.com/en/ruxitagentjs_ICA2SVfqru_10137171222133618.js#163)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:891)
....
com.gargoylesoftware.htmlunit.ScriptException: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "then" of undefined (https://www.gartner.com/en/ruxitagentjs_ICA2SVfqru_10137171222133618.js#163)