0

我正在尝试爬网,但无法使用 .netHttpRequestHttpResponse类登录。使用网络监视器,一个关键的区别似乎是来自浏览器的登录在 POST 消息中包含有效负载,而HttpRequest在单独的消息中发送有效负载,得到 301 响应。有没有办法让它使用单个消息?还是我还缺少其他东西?我已将此代码用于另一个网站,该网站有效:

// Set GET to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(logonUrl);

SiteRequest.Method = "GET";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.Referer = logonUrl;

SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();
mainStream = SiteResponse.GetResponseStream();
ReadAndIgnoreAllStreamBytes(mainStream);
mainStream.Close();

// Send POST to logon site.
SiteRequest = (HttpWebRequest)WebRequest.Create(postUrl);
SiteRequest.Method = "POST";
SiteRequest.AllowAutoRedirect = AllowRedirect;
SiteRequest.ContentType = "application/x-www-form-urlencoded";
SiteRequest.CookieContainer = SiteCookieContainer;
SiteRequest.CookieContainer.Add(SiteResponse.Cookies);
SiteRequest.Referer = postUrl;
SiteRequest.Timeout = TimeoutMsec;

buffer = Encoding.UTF8.GetBytes(logonPostData);
SiteRequest.ContentLength = buffer.Length;

postStream = SiteRequest.GetRequestStream();
postStream.Write(buffer, 0, buffer.Length);
postStream.Flush();
postStream.Close();

SiteResponse = (HttpWebResponse)SiteRequest.GetResponse();

在 HtmlAgilityPack 中使用 HtmlWeb 类有同样的问题。

谢谢。

更新:

原来我使用的是地址的“www.example.com”形式,而不是“example.com”,因此是重定向。但是我得到一个“404”页面未找到错误,地址正确。

以下是浏览器为帖子发送的内容:

- Http: Request, POST /accounts/signin 
    Command: POST
  + URI: /accounts/signin
    ProtocolVersion: HTTP/1.1
    Accept:  text/html, application/xhtml+xml, */*
    Referer:  http://***.com/accounts/signin
    Accept-Language:  en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
    UserAgent:  Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; Touch)
  + ContentType:  application/x-www-form-urlencoded
    Accept-Encoding:  gzip, deflate
    Host:  example.com
    ContentLength:  67
    DNT:  1
    Connection:  Keep-Alive
    Cache-Control:  no-cache
  - Cookie:  PHPSESSID=169***efe; lang=en_US; cart=eyJ***wfQ%3D%3D; cartitems=W10%3D; __utma=***; __utmb=***; __utmc=**; __utmz=**
      PHPSESSID: 169***efe
      lang: en_US
      cart: eyJ***wfQ%3D%3D
      cartitems: W10%3D
      __utma: ***
      __utmb: ***
      __utmc: ***
      __utmz: ***

    HeaderEnd: CRLF
  - payload: HttpContentType =  application/x-www-form-urlencoded
     url: 
     email: ***
     password: ***

这是我要发送的内容:

(邮政:)

- Http: Request, POST /accounts/signin 
    Command: POST
  + URI: /accounts/signin
    ProtocolVersion: HTTP/1.1
  + ContentType:  application/x-www-form-urlencoded
    Accept:  text/html, application/xhtml+xml, */*
    Accept-Language:  en-US,en;q=0.8,zh-Hans-CN;q=0.7,zh-Hans;q=0.5,zh-Hant-TW;q=0.3,zh-Hant;q=0.2
    Accept-Encoding:  gzip, deflate
    DNT:  1
    Cache-Control:  no-cache
    Referer:  http://***.com/accounts/signin
    Host:  chinesepod.com
  - Cookie:  lang=en_US; cart=eyJ***jowfQ%3D%3D; cartitems=W10%3D; PHPSESSID=944***3e7
      lang: en_US
      cart: eyJ***wfQ%3D%3D
      cartitems: W10%3D
      PHPSESSID: 944***3e7

    ContentLength:  61
    HeaderEnd: CRLF

(单独的有效载荷:)

- Http: HTTP Payload, URL: /accounts/signin 
  - payload: HttpContentType =  application/x-www-form-urlencoded
     url: 
     email: ***
     password: ***

浏览器版本有这些 __utXX cookie,我假设浏览器添加了一些标签,对吧?否则,假设 cookie 排序无关紧要,关键区别在于有效负载是单独发送的。看看还有什么不妥吗?

谢谢。

-约翰

4

0 回答 0