ruby - Nokogiri 抓取带有格式和链接标签、、、、等的文本

Question

score 2 · Accepted Answer

我会使用两种策略，Nokogiri 提取您想要的内容，然后使用黑名单/白名单程序去除您不想要的标签或保留您想要的标签。

require 'nokogiri'
require 'sanitize'

html = '
<div id="1">
  This is text in the TD with <strong> strong <strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>
'

doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html

将捕获的内容<div id="1">作为 HTML 字符串：

      This is text in the TD with <strong> strong <strong> tags
      <p>This is a child node. with <b> bold </b> tags</p>
      <div id="2">
          "another line of text to a <a href="link.html"> link </a>"
          <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
      </div>
    </strong></strong>

尾随</strong></strong>是两个开始<strong>标签的结果。这可能是故意的，但没有结束标签 Nokogiri 会做一些修复以使 HTML 正确。

传递html_fragment给Sanitize gem：

doc = Sanitize.clean(
  html_fragment,
  :elements   => %w[ a b em strong ],
  :attributes => {
    'a'    => %w[ href ],
  },
)

返回的文本如下所示：

 This is text in the TD with <strong> strong <strong> tags
  This is a child node. with <b> bold </b> tags 

      "another line of text to a <a href="link.html"> link </a>"
        This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em> 

</strong></strong>

同样，由于 HTML 格式不正确，没有结束</strong>标记，因此存在两个尾随结束标记。

ruby - Nokogiri 抓取带有格式和链接标签、、、、等的文本

1 回答 1

Related

Reference