1

所以我有一些带有一些href链接的内容,这些链接看起来像这样:

<p>Here you can find
    <a href="ssNODELINK/SurvivalStatistics">Survival stats </a>
    <a href="ssNODELINK/SmokingStatistics">Smoking stats </a>
    <a href="ssNODELINK/RisksAndCauses"> and Risks </a>
    <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>

还有一些

我想要的结果是删除所有ssNODELINKs您看到的列表并保留其他链接。结果如下:

在这里您可以找到生存统计数据 吸烟统计数据和近期研究的风险

我尝试了以下代码行来实现这一点:

page_content.gsub!(/(<a href="ssNODELINK/a-zA-Z">)/, ''))

这只会删除它的一部分

page_content.gsub!(/(<a href="ssNODELINK)/, '')) 

关于如何达到我想要的结果的任何建议?

4

1 回答 1

1

我会做如下:

require 'nokogiri'

doc = Nokogiri.HTML <<-eot
<p>Here you can find
    <a href="ssNODELINK/SurvivalStatistics">Survival stats </a>
    <a href="ssNODELINK/SmokingStatistics">Smoking stats </a>
    <a href="ssNODELINK/RisksAndCauses"> and Risks </a>
    <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>
eot

nodesets = doc.css('p > a')
nodesets.each do |nd|
  nd.unlink if nd['href'].include? 'ssNODELINK'
end

puts doc.to_html.gsub(/^\s*\n/, "") 
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>Here you can find
# >>     <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
# >> of recent research</p></body></html>
于 2013-11-07T19:56:38.453 回答