ruby - Hpricot 搜索方法

Question

如果我有结果而不是需要属性，我想在网页中进行搜索。这是网页：链接文本

我感兴趣的是，元数据的标题是否具有值为“og：title”的属性，或者，如果我想要内容值

如果我们查看页面的来源，它有一种药水：

<meta
property="og:title" content="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]" />

所以我想要 og:title 查询的真实结果和通过社交媒体探索泰坦尼克号沉船站点 [EXCLUSIVE] 值进行下一次搜索，如何正确执行

search("/html/head/meta[(@property='og:title']")不返回我想要的。

有什么建议吗？

score 2 · Accepted Answer

2

采用：

/html/head/meta[@property='og:title']/@content

于 2010-12-03T16:43:08.000 回答

score 1 · Accepted Answer

您的 XPath 中有一个错误，而且限制性太强：

search("/html/head/meta[(@property='og:title']")

应该：

search("/html/head/meta[@property='og:title']")

修复错误。我将其简化为：

search("//meta[@property='og:title']")

另外，你想做什么也不是很清楚。你想找

<meta 
  property="og:title" 
  content="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]" 
 />

并提取content参数？还是要定位标签，确认包含"og:title"属性标签和"Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]"内容，再做进一步处理？

也就是说，通常使用 CSS 访问器而不是 XPath 更简单。我更喜欢使用 Nokogiri，它同时具有 XPath 和 CSS 选择器；我在下面使用CSS：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://mashable.com/2010/08/06/expedition-titanic'))
(doc % 'meta[property="og:title"]')
=> #<Nokogiri::XML::Element:0x8084ee48 name="meta" attributes=[#<Nokogiri::XML::Attr:0x8084ed58 name="property" value="og:title">, #<Nokogiri::XML::Attr:0x8084ed1c name="content" value="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]">]>

Nokogiri 和 Hpricot 分别支持和/的%简写。“搜索”返回一个包含所有匹配项的数组，“at”只返回第一个匹配项。因此，上面的示例使用 CSS 获取第一个节点，表明这是正确的轨道。我不确定如何使用 CSS 来匹配同一个标签中的两个参数，所以我将使用跟踪所有标签，然后根据参数进行过滤：searchat <meta>property="og:title"content=

(doc / 'meta[property="og:title"]').select{ |n| n['content'][/titanic wreck site/i] }
=> [#<Nokogiri::XML::Element:0x8084ee48 name="meta" attributes=[#<Nokogiri::XML::Attr:0x8084ed58 name="property" value="og:title">, #<Nokogiri::XML::Attr:0x8084ed1c name="content" value="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]">]>]

那时，我们在返回的数组中找到了正确的节点，因此您可以提取任何您想要的内容，或者潜入其子节点并解雇和掠夺。为此，您需要使用.first或[0]获取实际节点以进行进一步处理：

(doc / 'meta[property="og:title"]').select{ |n| n['content'][/titanic wreck site/i] }.first

根据 OP 的响应进行更新，仍然使用 Nokogiri：

>> meta = (doc % 'meta[@property="og:title"]')['content']
>> meta #=> "Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]"

score 1 · Accepted Answer

感谢您的回答。当我发布我的问题时，我没有意识到我在搜索中有错误。那是星期五晚上...

正确的搜索是

elements = @doc.search("/html/head/meta[@property='og:title']")

它(在 @property 之前从表达式中删除了一个字符

这给出了：

elements = <meta property="og:title" content="Explore the Titanic Wreck Site via Social Media [EXCLUSIVE]" />

结果。比我检查我是否有东西，如果我有，我需要内容值

if elements.nil?
   puts 'not found'
  elsif elements.size > 0
    puts "Found one, og:title = #{elements}" 
    content = elements.attr("content");
    puts content # this will display the content ( it will be processed)
  else
    ... can come here the flow control? - theoretically yes, but in practice?
  end

ruby - Hpricot 搜索方法

3 回答 3

Related

Reference