我正在尝试爬网,但无法使用 .netHttpRequest
和HttpResponse
类登录。使用网络监视器,一个关键的区别似乎是来自浏览器的登录在 POST 消息中包含有效负载,而HttpRequest
在单独的消息中发送有效负载,得到 301 响应。有没有办法让它使用单个消息?还是我还缺少其他东西?我已将此代码用于另一个网站,该网站有效:
// Set GET to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(logonUrl);
SiteRequest.Method = "GET";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.Referer = logonUrl;
SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
mainStream = SiteResponse.GetResponseStream();
ReadAndIgnoreAllStreamBytes(mainStream);
mainStream.Close();
// Send POST to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(postUrl);
SiteRequest.Method = "POST";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.ContentType = "application/x-www-form-urlencoded";
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.CookieContainer.Add(SiteResponse.Cookies);
SiteRequest.Referer = postUrl;
SiteRequest.Timeout = TimeoutMsec;
buffer = Encoding.UTF8.GetBytes(logonPostData);
SiteRequest.ContentLength = buffer.Length;
postStream = SiteRequest.GetRequestStream();
postStream.Write(buffer, 0, buffer.Length);
postStream.Flush();
postStream.Close();
SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
在 HtmlAgilityPack 中使用 HtmlWeb 类有同样的问题。
谢谢。
更新:
原来我使用的是地址的“www.example.com”形式,而不是“example.com”,因此是重定向。但是我得到一个“404”页面未找到错误,地址正确。
以下是浏览器为帖子发送的内容:
- Http: Request, POST /accounts/signin
Command: POST
+ URI: /accounts/signin
ProtocolVersion: HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: http://***.com/accounts/signin
Accept-Language: en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
UserAgent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; Touch)
+ ContentType: application/x-www-form-urlencoded
Accept-Encoding: gzip, deflate
Host: example.com
ContentLength: 67
DNT: 1
Connection: Keep-Alive
Cache-Control: no-cache
- Cookie: PHPSESSID=169***efe; lang=en_US; cart=eyJ***wfQ%3D%3D; cartitems=W10%3D; __utma=***; __utmb=***; __utmc=**; __utmz=**
PHPSESSID: 169***efe
lang: en_US
cart: eyJ***wfQ%3D%3D
cartitems: W10%3D
__utma: ***
__utmb: ***
__utmc: ***
__utmz: ***
HeaderEnd: CRLF
- payload: HttpContentType = application/x-www-form-urlencoded
url:
email: ***
password: ***
这是我要发送的内容:
(邮政:)
- Http: Request, POST /accounts/signin
Command: POST
+ URI: /accounts/signin
ProtocolVersion: HTTP/1.1
+ ContentType: application/x-www-form-urlencoded
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
Accept-Encoding: gzip, deflate
DNT: 1
Cache-Control: no-cache
Referer: http://***.com/accounts/signin
Host: chinesepod.com
- Cookie: lang=en_US; cart=eyJ***jowfQ%3D%3D; cartitems=W10%3D; PHPSESSID=944***3e7
lang: en_US
cart: eyJ***wfQ%3D%3D
cartitems: W10%3D
PHPSESSID: 944***3e7
ContentLength: 61
HeaderEnd: CRLF
(单独的有效载荷:)
- Http: HTTP Payload, URL: /accounts/signin
- payload: HttpContentType = application/x-www-form-urlencoded
url:
email: ***
password: ***
浏览器版本有这些 __utXX cookie,我假设浏览器添加了一些标签,对吧?否则,假设 cookie 排序无关紧要,关键区别在于有效负载是单独发送的。看看还有什么不妥吗?
谢谢。
-约翰