我对网络抓取很陌生,我希望你能对我的问题有所了解。我找到了几篇关于我的问题的文章,但我似乎无法让它发挥作用。我遵循的最接近的教程是这个。 如何使用 Python 抓取需要先登录的网站
我正在尝试抓取以下网站:http ://amigobulls.com/stocks/GE/income-statement/quarterly
我的目标是抓取“下载通用电气财务报表”的下载链接。为了实现这一点,它需要登录。但是我似乎无法让登录位工作。
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
br.open('http://amigobulls.com/stocks/GE/income-statement/quarterly')
for f in br.forms():
print f
br.select_form(nr=0)
req = urllib2.Request(url, headers=hdr)
# User credentials
br.form['pass'] = '______'
br.select_form(nr=1)
br.form['name'] = '______'
for f in br.forms():
print f
# Login
br.submit()
print br.open('http://amigobulls.com/stocks/GE/income-statement/quarterly').read()
我得到的回复如下
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
<TextControl(<None>=)>
<PasswordControl(pass=)>
<CheckboxControl(remember_me=[*1])>
<SubmitControl(<None>=Login) (readonly)>
<TextControl(<None>=)>>
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
<TextControl(name=)>
<PasswordControl(<None>=)>
<PasswordControl(<None>=)>
<SubmitControl(<None>=Join Us) (readonly)>>
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
<TextControl(<None>=)>
<PasswordControl(pass=______)>
<CheckboxControl(remember_me=[*1])>
<SubmitControl(<None>=Login) (readonly)>
<TextControl(<None>=)>>
<GET http://amigobulls.com/stocks/GE/income-statement/quarterly# application/x-www-form-urlencoded
<TextControl(name=______)>
<PasswordControl(<None>=)>
<PasswordControl(<None>=)>
<SubmitControl(<None>=Join Us) (readonly)>>
然后是未登录站点的 HTML 代码。
如果我成功了,我应该能够找到下载链接。
任何人都可以帮忙吗?太感谢了!