ruby - 从链接文本中提取与 Nokogiri 的链接？

Question

我想从网页中提取特定链接，使用 Nokogiri 通过其文本搜索它：

<div class="links">
   <a href='http://example.org/site/1/'>site 1</a>
   <a href='http://example.org/site/2/'>site 2</a>
   <a href='http://example.org/site/3/'>site 3</a>
</div>

我想要“站点 3”的 href 并返回：

http://example.org/site/3/

或者我想要“站点 1”的 href 并返回：

http://example.org/site/1/

我该怎么做？

score 3 · Accepted Answer

原来的：

text = <<TEXT
<div class="links">
  <a href='http://example.org/site/1/'>site 1</a>
  <a href='http://example.org/site/2/'>site 2</a>
  <a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT

link_text = "site 1"

doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s

更新：

据我所知，Nokogiri 的 XPath 实现不支持正则表达式，对于基本starts with匹配，有一个函数调用starts-with，您可以像这样使用（链接以“s”开头）：

doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)

score 3 · Accepted Answer

也许你会更喜欢 css 样式选择：

doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere

score 1 · Accepted Answer

require 'nokogiri'

text = "site 1"

doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[@class='links']//a[contains(text(), '#{text}')]/@href").to_s

score 1 · Accepted Answer

只是为了记录另一种方式，我们可以在 Ruby 中使用 URI 模块执行此操作：

require 'uri'

html = %q[
<div class="links">
    <a href='http://example.org/site/1/'>site 1</a>
    <a href='http://example.org/site/2/'>site 2</a>
    <a href='http://example.org/site/3/'>site 3</a>
</div>
]

uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]

=> {
    1 => "http://example.org/site/1/'",
    2 => "http://example.org/site/2/'",
    3 => "http://example.org/site/3/'"
}

uris[1]
=> "http://example.org/site/1/'"

uris[3]
=> "http://example.org/site/3/'"

在幕后 URI.extract使用正则表达式，这不是在页面中查找链接的最可靠的方法，但它非常好，因为 URI 通常是一个没有空格的字符串，如果它有用的话。

ruby - 从链接文本中提取与 Nokogiri 的链接？

4 回答 4

Related

Reference