ruby - 在 hpricot/nokogiri 中搜索 h2 元素之前的所有元素

Question

我正在尝试解析一个维基词典条目以检索所有英文定义。我能够检索所有定义，问题是某些定义是其他语言的。我想做的是以某种方式只检索带有英文定义的 HTML 块。我发现，在有其他语言条目的情况下，可以通过以下方式检索英文定义之后的标题：

header = (doc/"h2")[3]

所以我只想搜索这个标题元素之前的所有元素。我认为这可能是可能的header.preceding_siblings()，但这似乎不起作用。有什么建议么？

score 2 · Accepted Answer

您可以使用 Nokogiri 的访问者模式。此代码将从其他语言定义的 h2 开始删除所有内容：

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node

score 1 · Accepted Answer

以下代码使用Hpricot。
它从英语语言 (h2) 的标题中获取文本，直到下一个标题 (h2)，或者如果没有其他语言，则直到页脚：

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

例子：

get_english_definition "http://en.wiktionary.org/wiki/gift"

score 1 · Accepted Answer

对于 Nokogiri：

doc = Nokogiri::HTML(code)
stop_node = doc.css('h2')[3]
doc.traverse do |node|
  break if node == stop_node
  # else, do whatever, e.g. `puts node.name`
end

这将遍历您stop_node在第 2 行中指定的任何节点之前的所有节点。

ruby - 在 hpricot/nokogiri 中搜索 h2 元素之前的所有元素

3 回答 3

Related

Reference