Nokogiri 支持两种主要类型的搜索,search
和at
. search
返回一个 NodeSet,您应该将其视为一个数组。at
返回一个节点。两者都可以采用 CSS 或 XPath 表达式。我更喜欢 CSS,因为它们更具可读性,但有时你不能很容易地用一个来到达你想去的地方,所以试试另一个。
对于您的问题,指定要从中提取文本的节点很重要,使用text
. 如果您的结果太宽泛,除了您想要的标签内的文本之外,您还会从标签之间获取文本。为了避免深入到您要阅读的内容的最直接节点:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<release>
<artists>
<artist>
<name>Johnny Mnemonic</name>
</artist>
<artist>
<name>Constantine</name>
</artist>
<artists>
<release>
EOT
因为这些专门寻找name
节点,所以很容易获得所需的文本而没有垃圾:
doc.at('name').text # => "Johnny Mnemonic"
doc.at('artist name').text # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"
这些是更宽松的搜索,因此返回了更多垃圾:
doc.at('artist').text # => "\n Johnny Mnemonic\n "
doc.at('artists').text # => "\n \n Johnny Mnemonic\n \n \n Constantine\n \n \n\n"
使用search
返回多个节点:
doc.search('name').map(&:text)
[
[0] "Johnny Mnemonic",
[1] "Constantine"
]
doc.search('artist').map(&:text)
[
[0] "\n Johnny Mnemonic\n ",
[1] "\n Constantine\n "
]
search
和之间唯一真正的区别at
是,at
就像search(...).first
.
另请参阅“如何避免在抓取时加入来自节点的所有文本”。
为方便起见,Nokogiri 有一些额外的别名:at_css
and css
, and at_xpath
and xpath
。
以下是替代方法,使用 CSS 和 XPath 访问器获取名称,从 Pry 中截取:
[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]