1

我正在尝试使用 Nokogiri 来提取两个独特标签集之间的文本。

在 and之间获取 p-tag 中的文本,然后在<h2 class="point">The problem</h2>and之间获取所有 HTML 的最佳方法是什么?<h2 class="point">The solution</h2><h2 class="point">The solution</h2><div class="frame box sketh">

完整的 html 示例:

<h2 class="point">The problem</h2>
<p>TEXT I WANT </p>
<h2 class="point">The solution</h2>
HTML I WANT with it's own set of tags (but never an <h2> or <div>)
<div class="frame box sketh"><img src="URL for Image I want later" alt="" /></div>

谢谢!

4

2 回答 2

2
require 'nokogiri'

doc = Nokogiri.HTML(DATA)
doc.search('//h2/following-sibling::node()[name() != "h2" and name() != "div" and text() != "\n"]').each do |block|
  p block.text
end

__END__
<h2 class="point">The problem</h2>
<p>TEXT I WANT</p>
<h2 class="point">The solution</h2>
<div>dont capture this</div>
<span>HTML I WANT with it's <p>own set <b>of</b> tags</p></span>
<div class="frame box sketh"><img src="URL for Image I want later" alt="" /></div>

输出:

"TEXT I WANT"
"HTML I WANT with it's own set of tags"

此 XPath 选择所有后续的兄弟节点,h2这些节点不是 a h2div或者只包含字符串"\n"

于 2012-09-18T13:58:17.193 回答
1

这是您如何在包含类点的两个 h2 之间获取 p 标签文本的方法

//h2[@class="point"][1]/following-sibling::p[./following-sibling::h2[@class="point"]]/text()

对于第二个,您应该探索w3schools,并以第一个为例并执行此操作。

于 2012-09-18T14:02:43.933 回答