html - 使用 Ruby 下载 HTML 文本

Question

我正在尝试在指定网页上创建字母（a、b、c 等）的直方图。我计划使用哈希制作直方图本身。但是，我在实际获取 HTML 时遇到了一些问题。

我当前的代码：

#!/usr/local/bin/ruby


require 'net/http'
require 'open-uri'


# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)

def open(url)
    Net::HTTP.get(URI.parse(url))
end

page_content = open('_insert_webpage_here')

page_content.each do |i|
    puts i
end

这在获取 HTML 方面做得很好。然而，它得到了一切。对于 www.stackoverflow.com，它给了我：

<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>

假装它是正确的页面，我不想要 html 标签。我只是想得到Object Movedand This document may be found here。

有没有相当简单的方法可以做到这一点？

score 2 · Accepted Answer

当你require 'open-uri'，你不需要open用 Net::HTTP 重新定义。

require 'open-uri'

page_content = open('http://www.stackoverflow.com').read

histogram = {}
page_content.each_char do |c|
  histogram[c] ||= 0
  histogram[c] += 1
end

注意：这不会<tags>在 HTML 文档中去掉，所以<html><body>x!</body></html>会有. 要删除标签，您可以使用诸如 Nokogiri 之类的东西（您说它不可用）或某种正则表达式（例如Dru's answer中的那个）。{ '<' => 4, 'h' => 2, 't' => 2, ... }{ 'x' => 1, '!' => 1 }

score 1 · Accepted Answer

1

请参阅此处的 Net::HTTP 文档中的“以下重定向”部分

于 2012-05-02T21:49:55.437 回答

score 1 · Accepted Answer

在没有 Nokogiri 的情况下剥离 html 标签

puts page_content.gsub(/<\/?[^>]*>/, "")

http://codesnippets.joyent.com/posts/show/615

html - 使用 Ruby 下载 HTML 文本

3 回答 3

Related

Reference