html - Rails 如何从用户输入的 url 解析链接标签

Question

我希望能够查看用户输入的 url 上的页面是否包含类似于以下内容的内容：

<link rel="alternate" type="application/rss+xml" ... href="http://feeds.example.com/MyBlog"/>

这样我就可以消除一种解析 atom 或 rss feed url 的选项。

有什么好的方法吗？我是否必须让我的服务器解析用户 url 的整个 html 并通过所有这些？

我需要在解析后使用变量中的 url

score 2 · Accepted Answer

您可以使用 Nokogiri gem - http://www.nokogiri.org/

这是使用他们的 css 样式文档搜索语法的示例：

require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open('http://www.example.com/'))
rss_xml_nodes = doc.css('link[rel="alternate"][type="application/rss+xml"]')
rss_xml_hrefs = rss_xml_nodes.collect { |node| node[:href] }

rss_xml_nodes 将包含 Nokogiri XML 元素的数组

rss_xml_hrefs 将包含一个字符串数组，其中包含节点的 href 属性

rss_xml_nodes.empty?
=> false

rss_xml_hrefs
=> ["http://www.example.com/rss-feed.xml", "http://www.example.com/rss-feed2.xml"]

score 0 · Accepted Answer

我相信您确实必须解析所有内容，因为除了通过单个 http 请求获取所有内容之外，没有其他方法可以获取任何内容。为此，您可以使用 Ruby 的 Net:HTTP 类，如下所示：

require 'net/http'

url = URI.parse('http://www.example.com/index.html')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
  http.request(req)
}

# regex below grabs all the hrefs on link tags
# print all the matches
res.body.scan(/<link[^>]*href\s*=\s*["']([^"']*)/).each {|match| 
  puts match
}

html - Rails 如何从用户输入的 url 解析链接标签

2 回答 2

Related

Reference