ruby - 用于在open-uri ruby 中查找 href 的正则表达式

Question

score 4 · Accepted Answer

如果您想查找a标签的href参数，请使用正确的工具，该工具通常不是正则表达式。您更有可能应该使用 HTML/XML 解析器。

Nokogiri是 Ruby 的首选解析器：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }

pp doc.search('a').map{ |a| a['href'] }
# => [
# =>  "/",
# =>  "/domains/",
# =>  "/numbers/",
# =>  "/protocols/",
# =>  "/about/",
# =>  "/go/rfc2606",
# =>  "/about/",
# =>  "/about/presentations/",
# =>  "/about/performance/",
# =>  "/reports/",
# =>  "/domains/",
# =>  "/domains/root/",
# =>  "/domains/int/",
# =>  "/domains/arpa/",
# =>  "/domains/idn-tables/",
# =>  "/protocols/",
# =>  "/numbers/",
# =>  "/abuse/",
# =>  "http://www.icann.org/",
# =>  "mailto:iana@iana.org?subject=General%20website%20feedback"
# => ]

score 1 · Accepted Answer

我看到这个正则表达式有几个问题：

在空标记中的斜杠之前不一定必须有空格，但是您的正则表达式需要它
您的正则表达式非常冗长且多余

请尝试以下操作，它会从 <a> 标记中提取 URL：

link = /<a \s   # Start of tag
    [^>]*       # Some whitespace, other attributes, ...
    href="      # Start of URL
    ([^"]*)     # The URL, everything up to the closing quote
    "           # The closing quotes
    /x          # We stop here, as regular expressions wouldn't be able to
                # correctly match nested tags anyway

ruby - 用于在open-uri ruby​​ 中查找 href 的正则表达式

2 回答 2

Related

Reference

ruby - 用于在open-uri ruby 中查找 href 的正则表达式