ruby - 将 HTML 字符串解析为数组

Question

我正在为 TinyMCE 生成的 HTML 正文开发类似于 wiki 的差异功能。diff-lcs是接受数组或对象的差异 gem。大多数差异任务都在代码上，只是比较行。HTML 文本正文的区别更为复杂。如果我只是插入文本正文，我会得到一个字符一个字符的比较。虽然输出是正确的，但它看起来像垃圾。

seq1 = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"

seq2 = seq1.gsub(/[.!?]/, '\0|').split('|')
=> ["<p>Here is a paragraph.", " A sentence with <strong>bold text</strong>.", "</p><p>The second paragraph.", "</p>"]

如果有人更改了第二段，则差异输出涉及前面的段落结束标记。我不能只使用strip_tags，因为我想在比较视图上保持格式。理想的比较是基于完整的句子，将 HTML 分离出来。

seq2.NokogiriMagic
=> ["<p>", "Here is a paragraph.", " A sentence with ", "<strong>", "bold text", "</strong>", ".", "</p>", "<p>", "The second paragraph.", "</p>"]

我发现了很多整洁的 Nokogiri 方法，但我没有找到上述方法。

score 3 · Accepted Answer

以下是使用SAX 解析器的方法：

require 'nokogiri'

html = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"

class ArraySplitParser < Nokogiri::XML::SAX::Document
  attr_reader :array
  def initialize; @array = []; end
  def start_element(name, attrs=[])
    tag = "<" + name
    attrs.each { |k,v| tag += " #{k}=\"#{v}\"" }
    @array << tag + ">"
  end
  def end_element(name); @array << "</#{name}>"; end
  def characters(str); @array += str.gsub(/\s/, '\0|').split('|'); end
end

parser = ArraySplitParser.new
Nokogiri::XML::SAX::Parser.new(parser).parse(html)
puts parser.array.inspect
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>"]

请注意，您必须将 HTML 包装在根元素中，以便 XML 解析器不会错过示例中的第二段。像这样的东西应该工作：

# ...
Nokogiri::XML::SAX::Parser.new(parser).parse('<x>' + html + '</x>')
# ...
puts parser.array[1..-2]
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>", "<p>", "The ", "second ", "paragraph.", "</p>"]

[编辑]更新以演示如何在“start_element”方法中保留元素属性。

score 2 · Accepted Answer

您不是用惯用的 Ruby 编写代码。我们不会在变量名中使用混合的大写/小写，同样，在一般的编程中，为了清楚起见，使用助记变量名是一个好主意。重构您的代码，使其更像我的编写方式：

tags = %w[p ol ul li h6 h5 h4 h3 h2 h1 em strong i b table thead tbody th tr td]
# Deconstruct HTML body 1
doc = Nokogiri::HTML.fragment(@versionOne.body)
nodes = doc.css(tags.join(', '))

# Reconstruct HTML body 1 into comparable array
output = []
nodes.each do |node|

  output << [
    "<#{ node.name }",
    node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
    '>'
  ].join

  output << node.children.to_s.gsub(/[\s.!?]/, '|\0|').split('|').flatten

  output << "</#{ node.name }>"

end

# Same deal for nokoOutput2

sdiff = Diff::LCS.sdiff(nokoOutput2.flatten, output.flatten)

该行：

tag | " #{ param.name }=\"#{ param.value }\" "

在您的代码中根本不是 Ruby，因为 String 没有|运算符。您是否将|运算符添加到您的代码中而不显示该定义？

我看到的一个问题是：

output << node.children.to_s.gsub(/[\s.!?]/, '|\0|').split('|').flatten

您正在查找的许多标签可以包含列表中的其他标签：

<html>
  <body>
    <table><tr><td>
      <table><tr><td>
        foo
      </td></tr></table>
    </td></tr></table>
  </body>
</html>

创建一个递归方法来处理：

node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),

可能会提高你的输出。这是未经测试的，但总体思路是：

def dump_node(node)

  output = [
    "<#{ node.name }",
    node.attributes.map { |param| '%s="%s"' % [param.name, param.value] }.join(' '),
    '>'
  ].join

  output += node.children.map{ |n| dump_node(n) }

  output << "</#{ node.name }>"

end

ruby - 将 HTML 字符串解析为数组

2 回答 2

Related

Reference