ruby-on-rails - 您如何从需要凭据 (SSL) 的网站上抓取信息？

Question

我想知道是否有人可以指出我正确的方向。我想从启用 SSL 的网站（URL 中的 https）中抓取 html/文本内容。所述站点的文件系统中将有多个分支。

我的问题是：

如何从我的 Rails 应用程序中为外部网站提供凭据？

谢谢！

score 2 · Accepted Answer

使用Typhoeus 宝石。

我之前也曾为这个问题而苦苦挣扎。

回答

但是，如果您使用 Typhoeus，

1.9.3p194 :001 > Typhoeus # Checking that Typhoeus gem is being used.
 => Typhoeus 
1.9.3p194 :002 > url = "https://twitter.com/"
 => "https://twitter.com/" 
1.9.3p194 :003 > response = Typhoeus::Request.get(url, :timeout => 5000)

 => #<Typhoeus::Response:0x007fdd8cc00488 @code=200, @curl_return_code=0, @curl_error_message="No error", @status_message=nil, @http_version=nil, @headers="HTTP/1.1 200 OK\r\nDate: Tue, 25 Sep 2012 23:56:32 GMT\r\nStatus: 200 OK\r\nX-Runtime: 0.08814\r\nX-MID: 0cfcab7a410834bf31115f9a5cd7fb62651aa568\r\nStrict-Transport-Security: max-age=631138519\r\nCache-Control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0\r\nContent-Type: text/html; charset=utf-8\r\nX-Frame-Options: SAMEORIGIN\r\nLast-Modified: Tue, 25 Sep 2012 23:56:32 GMT\r\nETag: \"95db45f50f8dc1a45be3895e03a23d53\"\r\nExpires: Tue, 31 Mar 1981 05:00:00 GMT\r\nX-Transaction: 72253ef75f0755e1\r\nPragma: no-cache\r\nSet-Cookie: k=10.35.35.113.1348617392068257; path=/; expires=Tue, 02-Oct-12 23:56:32 GMT; domain=.twitter.com\r\nSet-Cookie: guest_id=v1%3A134861739271966362; domain=.twitter.com; path=/; expires=Fri, 26-Sep-2014 11:56:32 GMT\r\nSet-Cookie: _twitter_sess=BAh7CToPY3JlYXRlZF9hdGwrCFBS3P85AToMY3NyZl9pZCIlNTY2MzNjOTM0%250AOTIyMDE4ZmNkY2E4NjViZmE3ZTBkMDAiCmZsYXNoSUM6J0FjdGlvbkNvbnRy%250Ab2xsZXI6OkZsYXNoOjpGbGFzaEhhc2h7AAY6CkB1c2VkewA6B2lkIiViYjAw%250AY2Q1YWZkMDAwNmExNWJhNjAyYmNiNzBhOTA0Yg%253D%253D--5ffbea931432fe65a2128be90048e3bb6fc9dbca; domain=.twitter.com; path=/; HttpOnly\r\nX-XSS-Protection: 1; mode=block\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\nContent-Length: 13733\r\nServer: tfe\r\n\r\n", @body="<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    \n    <script>document.domain='twitter.com'</script>\n\n      <title>Twitter</title>\n\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\n    \n      <meta name=\"description\" content=\"Instantly connect to what&#39;s most important to you. Follow your friends, experts, favorite celebrities, and breaking news.\">\n    \n    \n      <link href=\"/favicons/favicon.ico\" rel=\"shortcut icon\" type=\"image/x-icon\">\n    \n    \n          <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/1348559220/t1/css/t1_core_logged_out.bundle.css\" type=\"text/css\" media=\"screen\">\n    \n        <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/13485592

1.9.3p194 :005 >    response.body # returns html document
 => "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    \n    <script>document.domain='twitter.com'</script>\n\n      <title>Twitter</title>\n\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\">\n    \n      <meta name=\"description\" content=\"Instantly connect to what&#39;s most important to you. Follow your friends, experts, favorite celebrities, and breaking news.\">\n    \n    \n      <link href=\"/favicons/favicon.ico\" rel=\"shortcut icon\" type=\"image/x-icon\">\n    \n    \n          <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/1348559220/t1/css/t1_core_logged_out.bundle.css\" type=\"text/css\" media=\"screen\">\n    \n        <link rel=\"stylesheet\" href=\"https://twimg0-a.akamaihd.net/a/1348559220/t1/css/t1_more.bundle.css\" type=\"text/css\" media=\"screen\">\n    \n          <script>\n      (function() {\n        function getPhxPath(){var a=l.href.match(/#(.)(.*)$/);return a&&a[1]==\"!\"&&a[2]}function getEvent(a){return a?(a=a.replace(/^#|\\/$/,\"\").toLowerCase(),a.match(/^[a-z0-9_]+$/)?a:!1):!1}function redirectEventPath(a){var a=getEvent(a);if(a){var b=document.referrer||\"none\",c=\"ev_redir_\"+a+\"=\"+b+\"; path=/\";document.cookie=c,l.replace(\"/hashtag/\"+a)}}function resolveInlineRedirects(){var a=getPhxPath();a&&l.replace(a),l.hash!=\"\"&&redirectEventPath(l.hash.substr(1).toLowerCase())}var l=window.location;resolveInlineRedirects(),window.addEventListener?window.addEventListener(\"hashchange\",resolveInlineRedirects,!1):window.attachEvent&&window.attachEvent(\"onhashchange\",resolveInlineRedirects);\n      }());\n      </script>\n    \n    <script>\n      \n      \n      (func

祝你好运！

score 1 · Accepted Answer

我可以帮你解决这个问题。其实没那么难。

open("http://...", :http_basic_authentication=>[user, password])

如果你想解析，你甚至可以调整我的爬虫。这是其中的主要方法。

require "open-uri"
require "zlib"

SHINSO_HEADERS = {
  'Accept'          => '*/*',
  'Accept-Charset'  => 'utf-8, windows-1251;q=0.7, *;q=0.6',
  'Accept-Encoding' => 'gzip,deflate',
  'Accept-Language' => 'bg-BG, bg;q=0.8, en;q=0.7, *;q=0.6',
  'Connection'      => 'keep-alive',
  'Cookie'          => '',
  'From'            => 'email@example.com',
  'Referer'         => 'http://svejo.net/',
  'User-Agent'      => 'Your user agent'
}

def crawl(url_address)
  self.errors = Array.new
  begin
    begin
      url_address = URI.parse(url_address)
    rescue URI::InvalidURIError
      url_address = URI.decode(url_address)
      url_address = URI.encode(url_address)
      url_address = URI.parse(url_address)
    end
    url_address.normalize!
    stream = ""
    timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
    if stream.size > 0
      url_crawled = URI.parse(stream.base_uri.to_s)
    else
      self.errors << "Server said status 200 OK but document file is zero bytes."
      return
    end
  rescue Exception => exception
    self.errors << exception
    return
  end
end

url_crawled是你最终需要的。

尝试使用此地址进行测试。 https://developer.mozilla.org/en-US/docs/HTTP_access_control

如果您仍然遇到错误，您的服务器可能配置不正确，证书明智，您应该检查一下。

在相关说明中，如果您认真对待解析，您还可以考虑使用 CharGuess gem 和 Zlib 来正确读取内容，然后使用 Iconv 转换有问题的内容。这是一个例子。

if    stream.content_encoding.include?('gzip')
  document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
  document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
  document = stream.read
end
self.charset_guess = CharGuess.guess(document)

然后只需在内容上使用 Iconv。

希望这对您有所帮助。

问候，雅沃尔

score 0 · Accepted Answer

require 'httpclient'
require 'nokogiri'

client = HTTPClient.new

client.set_auth("http://domain.com", "username", "password")

doc = Nokogiri::HTML(c.get_content("http://example.com"))

嘿伙计们，很抱歉回复晚了，我已经被一些事情淹没了。上面的代码对我有用。（经过多次机械化探戈和其他一些基于 nokogiri 的宝石）。其他一些 gem，例如 openuri、mechanize 等，会导致错误，例如 MD5 Unknown hashing algorithm。感谢您的时间和帮助。

ruby-on-rails - 您如何从需要凭据 (SSL) 的网站上抓取信息？

3 回答 3

Related

Reference