1

我有一个结构如下的 HTML 文档:

<li class="indent1">(something)
  <li class="indent2">(something else)</li>
  <li class="indent2">(something else)
    <li class="indent3">(another sublevel)</li>
  </li>
  <li class="indent2">(something else)</li>
</li>

我需要做的是将这些 LI 标签包装在 OL 标签中。整个文档中有许多这样的列表。HTML 需要如下所示:

<ol>
  <li>(something)
    <ol>
      <li>(something else)</li>
      <li>(something else)
        <ol>
          <li>(another sublevel)</li>
        </ol>
      </li>
      <li>(something else)</li>
    </ol>
  </li>
</ol>

我怎么能在 Nokogiri 做这个?提前谢谢了。

编辑:

这是原始文档中的 HTML 示例。我的脚本将所有 P 标签转换为 LI 标签。

  <p class="indent1"><i>a.</i> This regulation describes the Army Planning, Programming,
  Budgeting, and Execution System (PPBES). It explains how an integrated Secretariat and
  Army Staff, with the full participation of major Army commands (MACOMs), Program
  Executive Offices (PEOs), and other operating agencies--</p>

  <p class="indent2">(1) Plan, program, budget, and then allocate and manage approved
  resources.</p>

  <p class="indent2">(2) Provide the commanders in chief (CINCs) of United States unified
  and specified commands with the best mix of Army forces, equipment, and support
  attainable within available resources.</p>

  <p class="indent1"><i>b.</i> The regulation assigns responsibilities and describes
  policy and procedures for using the PPBES to:</p>

缩进 1 类表示一级列表项,缩进 2 表示二级等。我需要将这些缩进类转换为正确的有序列表。

4

2 回答 2

1

以下解决方案通过遍历<li>文档中的每一个来工作,或者:

  • 如果没有前面的,用一个新的<ol>交换,然后把里面放在那里。<li><li>
  • 如果前面有一个<ol><li>则将其移入其中。
document.css('li').each do |li|
  if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
    li.previous_element << li
  else
    li.replace('<ol/>').first << li
  end
end

在这里,经过测试:

require 'nokogiri'

# Use XML instead of HTML fragment due to problems with XPath
fragment = Nokogiri::XML.fragment '
  <li>List 1
    <li>List 1a</li>
    <li>List 1b
      <li>List 1bi</li>
    </li>
    <li>List 1c</li>
    New List
    <li>New List 1a</li>
  </li>
  <p>Break 1</p>
  <li>List 2a</li>
  <li>List 2b</li>
  <p>Break 2</p>
  <li>List 3 <li>List 3a</li></li>
'

fragment.css('li').each do |li|
  # Complex test to see if the preceding element is an <ol> and there's no non-empty text the li and it
  # See http://stackoverflow.com/q/14045519/405017
  if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
    li.previous_element << li
  else
    li.replace('<ol/>').first << li
  end
end

puts fragment   # I've normalized the whitespace in the output to make it clear
#=> <ol>
#=>   <li>List 1
#=>     <ol>
#=>       <li>List 1a</li>
#=>       <li>List 1b
#=>         <ol>
#=>           <li>List 1bi</li>
#=>         </ol>
#=>       </li>
#=>       <li>List 1c</li>
#=>     </ol>
#=>     New List
#=>     <ol><li>New List 1a</li></ol>
#=>   </li>
#=> </ol>
#=> <p>Break 1</p>
#=> <ol>
#=>   <li>List 2a</li>
#=>   <li>List 2b</li>
#=> </ol>
#=> <p>Break 2</p>
#=> <ol>
#=>   <li>List 3
#=>     <ol>
#=>       <li>List 3a</li>
#=>     </ol>
#=>   </li>
#=> </ol>
于 2012-12-26T16:03:12.027 回答
-1

问题是您的 html 格式错误。您无法使用 nokogiri 成功解析它。

于 2012-12-27T11:09:00.500 回答