nokogiri - 如何将此 hpricot 代码翻译为 nokogiri？

Question

 Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")

hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove

在http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/上找到它

score 0 · Accepted Answer

Nokogiri 和 Hpricot 可以互换。即 Nokogiri(html) 相当于 Hpricot(html)。不太确定我是否了解链接的文章试图实现的目标，但要：

从 HTML 正文中提取文本，包括忽略标签和单词之间的大空格。

这将是 Hpricot 中一种更简单的方法，并且不需要这些hpricot.search("script").remove位。即首先得到身体：

Hpricot(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")

在 Nokogiri：

Nokogiri(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")

nokogiri - 如何将此 hpricot 代码翻译为 nokogiri？

1 回答 1

Related

Reference