我的尝试是从这个网站中提取一个表格。
网页是中文的,但基本上,您可以在网页中间蓝色大按钮上方的那些框中输入您的登录详细信息。登录后,表格将出现在页面中间。注意:在 /articlenew.html 中,登录时只需要 USERNAME 和 PASSWORD。没有其他的。
认证后,网页的头部如下图所示:
Request URL:http://www.sxcoal.com/user/login.aspx
Request Method:POST
Status Code:302 Found
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en,en-GB;q=0.8,zh;q=0.6,zh-CN;q=0.4
Connection:keep-alive
Content-Length:39
Content-Type:application/x-www-form-urlencoded
Cookie:the_cookies
Host:www.sxcoal.com
Origin:http://www.sxcoal.com
Referer:http://www.sxcoal.com/coal/3478186/articlenew.html
User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Form Dataview sourceview URL encoded
username:myusername
password:mypassword
Response Headersview source
Cache-Control:private
Content-Length:167
Content-Type:text/html; charset=gb2312
Date:Thu, 14 Nov 2013 01:06:00 GMT
Location:http://www.sxcoal.com/coal/3478186/articlenew.html
Server:Microsoft-IIS/7.0
Set-Cookie:s_info=zhuhaiqinfa|15816; domain=sxcoal.com; path=/
X-AspNet-Version:2.0.50727
X-Powered-By:ASP.NET
我尝试使用Gergely Daróczi展示的方法。但是,由于某些原因,R 无法登录。我的猜测是 /login.aspx (http:[DELETE]//www.[DELETE]sxcoal.[DELETE]com/user/login.[DELETE]aspx )[对不起,我没有足够的“声誉”来发布更多链接。] 嵌套在 /articlenew.html 中实际上需要的不仅仅是用户名和相应的密码。我将 /login.aspx 的标题放在问题的末尾。
这是我使用的代码,
library(RCurl)
mycurl <- getCurlHandle()
agent <- "Mozilla/5.0"
curlSetOpt(cookiejar = "", followlocation = TRUE, useragent = agent, autoreferer = TRUE, curl = mycurl)
html <- getURL('http://www.sxcoal.com/user/login.aspx', curl = mycurl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
eventvalidation <- as.character(sub('.*id="__EVENTVALIDATION" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
##checkcode <- ??????????????? ## can't define it as it changes
params <- list(
"txtuser" = "myusername",
"txtpass" = "mypassword",
"__VIEWSTATE" = viewstate,
"__EVENTVALIDATION" = eventvalidation,
"CheckCode" = checkcode,
"Button2" = ""
)
html <- postForm('http://www.sxcoal.com/user/login.aspx', .params = params, curl = mycurl)
这CheckCode
是图片显示的验证码(http://www.sxcoal.com/CheckCode/CheckCode.aspx)。与__VIEWSTATE
and不同__EVENTVALIDATION
,CheckCode
每次刷新页面时都会发生变化。而且有些事情很复杂,因为我对网站编码一无所知。在我看来,嵌套在 /articlenew.html 中的 /login.aspx 所需的登录详细信息与 /login.aspx 本身所需的登录详细信息不同。是否有任何方法可以修复网络所需的登录详细信息,以便我不需要处理随机图片显示的验证码?如果没有,谁能知道我如何处理验证图片?
提前致谢。
Request URL:http://www.sxcoal.com/user/login.aspx
Request Method:POST
Status Code:302 Found
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en,en-GB;q=0.8,zh;q=0.6,zh-CN;q=0.4
Connection:keep-alive
Content-Length:234
Content-Type:application/x-www-form-urlencoded
Cookie:the_cookies
Host:www.sxcoal.com
Origin:http://www.sxcoal.com
Referer:http://www.sxcoal.com/user/login.aspx
User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Form Dataview sourceview URL encoded
__VIEWSTATE:whatever_it_is
txtuser:myusername
txtpass:mypassword
CheckCode:04854
Button2:
__EVENTVALIDATION:whatever_it_it_2
Response Headersview source
Cache-Control:private
Content-Length:170
Content-Type:text/html; charset=gb2312
Date:Thu, 14 Nov 2013 01:09:57 GMT
Location:http://www.sxcoal.com/?aspxerrorpath=/user/login.aspx
Server:Microsoft-IIS/7.0
X-AspNet-Version:2.0.50727
X-Powered-By:ASP.NET