ruby - 用 Hpricot 编写的 Ruby HTML 刮板遇到转义 HTML 的问题

Question

我正在尝试抓取此页面：http ://www.udel.edu/dining/menus/russell.html 。我使用 Hpricot 库在 Ruby 中编写了一个刮板。

问题：HTML 页面被转义，我需要不转义地显示它

example: "M&amp;M" should be "M&M"  
example: "Entr&eacute;e" should be "Vegetarian Entrée"

我曾尝试在 Ruby 中使用 CGI 库（不太成功）以及通过 Stack Overflow 帖子找到的 HTMLEntities gem。

HTMLEntities 在测试期间工作：

require 'rubygems' 
require 'htmlentities'
require 'cgi'

h = HTMLEntities.new
puts "h.decode('Entr&eacute;e') = #{h.decode("Entr&eacute;e")}"

blank = "&nbsp;"
puts "h.decode blank = #{h.decode blank}"
puts "CGI.unescapeHTML blank = |#{CGI.unescapeHTML blank}|"

puts "h.decode '<th width=86 height=59 scope=row>Vegetarian Entr&eacute;e</th> ' = |#{h.decode '<th width=86 height=59 scope=row>Vegetarian Entr&eacute;e</th> '}|"

正确产生

h.decode('Entr&eacute;e') = Entrée
h.decode blank =  
CGI.unescapeHTML blank = |&nbsp;|
h.decode '<th width=86 height=59 scope=row>Vegetarian Entr&eacute;e</th> ' = |<th width=86 height=59 scope=row>Vegetarian Entrée</th> |

但是，当我在带有 open-uri 的文件上使用它时，它无法正常工作：

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'cgi'
f = open("http://www.udel.edu/dining/menus/russell.html")
htmlentity = HTMLEntities.new
while line = f.gets
  puts htmlentity.decode line
end

错误地产生如下内容：

<th width="60" height="59" scope="row">Vegetarian EntrÃ©e</th>

和

<th scope="row">Â </th>  // note: was originally '&nbsp;' to indicate a blank

但通过产生正确处理 M&M：

<td valign="middle" class="menulineA">M&M Brownies</td>

我是否错误地处理了转义的 HTML？我不明白为什么它在某些情况下有效，而在其他情况下无效。

我正在运行 ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]

任何帮助/建议表示赞赏。谢谢。

score 0 · Accepted Answer

HTMLEntities 似乎工作，但你有一个编码问题。您正在打印的终端可能已针对脚本输出的 utf-8 字符设置了拉丁字符集和 barfs。

你在什么环境下运行 ruby ？

'&' 正确显示的原因是它是一个 ascii 字符，因此在大多数编码中都会显示相同的字符。问题是它不应该单独出现在 xml 文档中，并且稍后当您将解码的文件提供给 hpricot 时可能会出现问题. 我相信正确的方法是使用 hpricot 进行解析，然后将您从文档中提取的内容传递给 HTMLEntity。

ruby - 用 Hpricot 编写的 Ruby HTML 刮板遇到转义 HTML 的问题

1 回答 1

Related

Reference