ruby - 用于在open-uri ruby 中查找 href 的正则表达式
问问题
909 次
2 回答
4
如果您想查找a
标签的href
参数,请使用正确的工具,该工具通常不是正则表达式。您更有可能应该使用 HTML/XML 解析器。
Nokogiri是 Ruby 的首选解析器:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }
pp doc.search('a').map{ |a| a['href'] }
# => [
# => "/",
# => "/domains/",
# => "/numbers/",
# => "/protocols/",
# => "/about/",
# => "/go/rfc2606",
# => "/about/",
# => "/about/presentations/",
# => "/about/performance/",
# => "/reports/",
# => "/domains/",
# => "/domains/root/",
# => "/domains/int/",
# => "/domains/arpa/",
# => "/domains/idn-tables/",
# => "/protocols/",
# => "/numbers/",
# => "/abuse/",
# => "http://www.icann.org/",
# => "mailto:iana@iana.org?subject=General%20website%20feedback"
# => ]
于 2012-11-13T00:18:49.763 回答
1
我看到这个正则表达式有几个问题:
在空标记中的斜杠之前不一定必须有空格,但是您的正则表达式需要它
您的正则表达式非常冗长且多余
请尝试以下操作,它会从 <a> 标记中提取 URL:
link = /<a \s # Start of tag
[^>]* # Some whitespace, other attributes, ...
href=" # Start of URL
([^"]*) # The URL, everything up to the closing quote
" # The closing quotes
/x # We stop here, as regular expressions wouldn't be able to
# correctly match nested tags anyway
于 2012-11-12T23:24:30.947 回答