ruby-on-rails - Nokogiri：在元素A和B之间选择内容

Question

让 Nokogiri 选择开始和停止元素（包括开始/停止元素）之间的所有内容的最聪明的方法是什么？

检查下面的示例代码以了解我在寻找什么：

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <p class="this">Foo</p>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <p class="that">Bar</p>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
      <p id='para-7'>F</p>
      <p id='para-8'>G</p>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# START element
@start_element = parent.at('p#para-3')
# STOP element
@end_element = parent.at('p#para-7')

结果（返回值）应如下所示：

<p id='para-3'>C</p>
<p class="that">Bar</p>
<p id='para-4'>D</p>
<p id='para-5'>E</p>
<div class='block' id='X2'>
  <p id='para-6'>F</p>
</div>
<p id='para-7'>F</p>

更新：这是我目前的解决方案，但我认为必须有更聪明的东西：

@my_content = ""
@selected_node = true

def collect_content(_start)

  if _start == @end_element
    @my_content << _start.to_html
    @selected_node = false
  end

  if @selected_node == true
    @my_content << _start.to_html
    collect_content(_start.next)
  end

end

collect_content(@start_element)

puts @my_content

score 10 · Accepted Answer

一种使用递归的智能单线器：

def collect_between(first, last)
  first == last ? [first] : [first, *collect_between(first.next, last)]
end

迭代解决方案：

def collect_between(first, last)
  result = [first]
  until first == last
    first = first.next
    result << first
  end
  result
end

编辑：（简短）asterix 的解释

它被称为 splat 运算符。它“展开”一个数组：

array = [3, 2, 1]
[4, array]  # => [4, [3, 2, 1]]
[4, *array] # => [4, 3, 2, 1]

some_method(array)  # => some_method([3, 2, 1])
some_method(*array) # => some_method(3, 2, 1)

def other_method(*array); array; end
other_method(1, 2, 3) # => [1, 2, 3]

score 2 · Accepted Answer

# monkeypatches for Nokogiri::NodeSet
# note: versions of these functions will be in Nokogiri 1.3
class Nokogiri::XML::NodeSet
  unless method_defined?(:index)
    def index(node)
      each_with_index { |member, j| return j if member == node }
    end
  end

  unless method_defined?(:slice)
    def slice(start, length)
      new_set = Nokogiri::XML::NodeSet.new(self.document)
      length.times { |offset| new_set << self[start + offset] }
      new_set
    end
  end
end

#
#  solution #1: picking elements out of node children
#  NOTE that this will also include whitespacy text nodes between the <p> elements.
#
possible_matches = parent.children
start_index = possible_matches.index(@start_element)
stop_index = possible_matches.index(@end_element)
answer_1 = possible_matches.slice(start_index, stop_index - start_index + 1)

#
#  solution #2: picking elements out of a NodeSet
#  this will only include elements, not text nodes.
#
possible_matches = value.xpath("//body/*")
start_index = possible_matches.index(@start_element)
stop_index = possible_matches.index(@end_element)
answer_2 = possible_matches.slice(start_index, stop_index - start_index + 1)

score 2 · Accepted Answer

为了完整起见，只有 XPath的解决方案 :)
它构建了两个集合的交集，即起始元素的以下兄弟姐妹和结束元素的前面兄弟姐妹。

基本上你可以建立一个交叉点：
  $a[count(.|$b) = count($b)]

为了可读性，变量有点分歧：

@start_element = "//p[@id='para-3']"
@end_element = "//p[@id='para-7']"
@set_a = "#@start_element/following-sibling::*"
@set_b = "#@end_element/preceding-sibling::*"

@my_content = value.xpath("#@set_a[ count(.|#@set_b) = count(#@set_b) ]
                         | #@start_element | #@end_element")

Siblings 不包含元素本身，因此表达式中必须分别包含 start 和 end 元素。

编辑：更简单的解决方案：

@start_element = "p[@id='para-3']"
@end_element = "p[@id='para-7']"
@my_content = value.xpath("//*[preceding-sibling::#@start_element and
                               following-sibling::#@end_element]
                         | //#@start_element | //#@end_element")

ruby-on-rails - Nokogiri：在元素A和B之间选择内容

3 回答 3

Related

Reference