ruby - 提取
标签之间的文本

Question

要提取 URL，我使用以下内容：

html = open('http://lab/links.html')
urls = URI.extract(html)

这很好用。

现在我需要提取一个不带前缀 http 或 https 的 URL 列表，它们位于 标签之间。由于没有 http 或 https 标签，URI.extract 不起作用。

domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php

每个不带前缀的 URL 都位于 标签之间。

~~我一直在查看此Nokogiri Xpath 以在 <TD> 和 中的 之后检索文本，但无法使其正常工作。~~

输出

domain1.com/index.html
domain2.com/home/~john/index.html
domain3.com/a/b/c/d/index.php

~~中间溶液~~

~~doc = Nokogiri::HTML(open("http://lab/noprefix_domains.html")) doc.search('br').each do |n| n.replace("\n") end puts doc~~

~~我仍然需要去掉其余的 HTML 标记 ( !DOCTYPE, html, body, p)...~~

解决方案

str = ""
doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") }
puts str.split /\s*<\s*br\s*>\s*/

谢谢。

score 2 · Accepted Answer

假设您已经有一种方法来提取您在问题中显示的示例字符串，您可以split在字符串上使用：

str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php"
str.split /\s*<\s*br\s*>\s*/
#=> ["domain1.com/index.html", 
#    "domain2.com/home/~john/index.html",
#    "domain3.com/a/b/c/d/index.php"]

 这将在每个标签处拆分字符串。它还将删除之前和之后的空格， 并允许 标签内有空格，例如 or  。如果您也需要处理自闭合标签（例如 ），请改用此正则表达式：

/\s*<\s*br\s*\/?\s*>\s*/

ruby - 提取标签之间的文本

1 回答 1

Related

Reference

ruby - 提取
标签之间的文本