ruby - 使用 Nokogiri 在 XML 中禁用 HTML 转义

Question

我正在尝试从 Google Directions API 解析 XML 文档。

这是我到目前为止所得到的：

x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
  q.xpath("html_instructions").each do |h|
    puts h.inner_html
  end
end

输出如下所示：

Head &lt;b&gt;south&lt;/b&gt; on &lt;b&gt;Hidden Pond Dr&lt;/b&gt; toward &lt;b&gt;Ironwood Ct&lt;/b&gt;
Turn &lt;b&gt;right&lt;/b&gt; onto &lt;b&gt;N Territorial Rd&lt;/b&gt;
Turn &lt;b&gt;left&lt;/b&gt; onto &lt;b&gt;Gotfredson Rd&lt;/b&gt;
...

我希望输出为：

Turn <b>right</b> onto <b>N Territorial Rd</b>

问题似乎是 Nokogiri 在 xml 中转义 html

我相信谷歌，但我认为进一步对其进行清理以：

Turn right onto N Territorial Rd

但是如果没有原始 xml，我就不能（也许使用sanitize ）。想法？

score 5 · Accepted Answer

因为我没有安装 Google Directions API，所以我无法访问 XML，但我强烈怀疑问题出在告诉 Nokogiri 您正在处理 XML 的结果。结果，它将返回给您编码的 HTML，就像它应该在 XML 中一样。

您可以使用以下方式对 HTML 进行转义：

CGI::unescape_html('Head &lt;b&gt;south&lt;/b&gt; on &lt;b&gt;Hidden Pond Dr&lt;/b&gt; toward &lt;b&gt;Ironwood Ct&lt;/b&gt;')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"

unescape_html是的别名unescapeHTML：

对已被 HTML 转义的字符串进行转义
  CGI::unescapeHTML("用法：foo "bar" <baz>")
     # => "用法：foo \"bar\" "

我不得不多考虑一下。这是我遇到过的事情，但这是我在忙碌的工作中逃脱的事情之一。解决方法很简单：您使用了错误的方法来检索内容。代替：

puts h.inner_html

采用：

puts h.text

我证明了这一点：

require 'httpclient'
require 'nokogiri'

# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new

doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
  puts html.text
end

哪个输出：

Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]

不同的是inner_html直接读取节点的内容，不解码。text为你解码。text,to_str并且在 Nokogiri::XML::Node 内部inner_text被别名为我们解析的乐趣。content

score 1 · Accepted Answer

将节点包装在 CDATA 中：

def wrap_in_cdata(node)
    # Using Nokogiri::XML::Node#content instead of #inner_html (which
    # escapes HTML entities) so nested nodes will not work
    node.inner_html = node.document.create_cdata(node.content)
    node
end

Nokogiri::XML::Node#inner_html转义 HTML 实体，除了 CDATA 部分。

fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left &gt; right &gt; straight &amp; reach your destination.</span></div>


fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>

score -1 · Accepted Answer

这不是一个很好的或干燥的解决方案，但它有效：

puts h.inner_html.gsub("&lt;b&gt;" , "").gsub("&lt;/b&gt;", "").gsub("&lt;div style=\"font-size:0.9em\"&gt;", "").gsub("&lt;/div&gt;", "")

ruby - 使用 Nokogiri 在 XML 中禁用 HTML 转义

3 回答 3

Related

Reference