问问题
847 次
1 回答
2
我会使用两种策略,Nokogiri 提取您想要的内容,然后使用黑名单/白名单程序去除您不想要的标签或保留您想要的标签。
require 'nokogiri'
require 'sanitize'
html = '
<div id="1">
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a <a href="link.html"> link </a>"
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
'
doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html
将捕获的内容<div id="1">
作为 HTML 字符串:
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id="2">
"another line of text to a <a href="link.html"> link </a>"
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
</div>
</strong></strong>
尾随</strong></strong>
是两个开始<strong>
标签的结果。这可能是故意的,但没有结束标签 Nokogiri 会做一些修复以使 HTML 正确。
传递html_fragment
给Sanitize gem:
doc = Sanitize.clean(
html_fragment,
:elements => %w[ a b em strong ],
:attributes => {
'a' => %w[ href ],
},
)
返回的文本如下所示:
This is text in the TD with <strong> strong <strong> tags
This is a child node. with <b> bold </b> tags
"another line of text to a <a href="link.html"> link </a>"
This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em>
</strong></strong>
同样,由于 HTML 格式不正确,没有结束</strong>
标记,因此存在两个尾随结束标记。
于 2013-01-09T00:30:37.673 回答