html - 使用 Jsoup 扁平化 HTML 文档

Question

HTML文档是分层的，可以使用JsoupDOM解析成树。

有没有办法在这些文档中提取语义“部分”，使用模式匹配，其中每个匹配表示“部分”的开始和前一个的结束，并且部分可以有子部分，无限？

这里的主要困难是属于“节”开头的 HTML 文本不一定是有效的 HTML（例如，在节开头嵌套在其他标签内的情况下）。提取“部分”及其直接“子”（子部分）的所有 HTML 内容的遍历将是所需的输出。

请注意，问题可以简化为提取两个 HTML 标记之间的内容（节的开头，包括在内和 ned 节的开头，不包括），因为即使模式正确匹配文档中的某些随机文本，它的可以使用第一个环绕的 HTML 标记。

Is there any way of doing this in Jsoup, i.e. given 2 Nodes to extract the HTML in between, irrespective of the hierarchical (nesting) level they belong to?

以下示例使用标记匹配来描述语义“部分”，<h1>为简单起见，仅限于 HTML 标头（例如）。“部分”的层次结构是：

{Flattening HTML Documents [
    {Introduction},
    {Methodology [
      {Recursion [
        {First Approach}, {Second Approach}]
      },
      {Tree Traversal [
       {Depth-First Search}, {Breadth-First Search}]
      }
    },
    {Conclusion}
}

这是原始 HTML。

<html>
  <head><title>Flattening HTML Documents</title></head>
  <body>
<h1>Flattening HTML Documents</h1>
    The requirement is to read each document in memory and extract its "sections",<br/>
    in sequential order, keeping track of subsections, in a tree-like manner.
    <div>
      <h2>Introduction</h2>
      Flattening HTML documents using <em><u>predefined</u> tag</em> values<br/>
      to mark the start of a section, which is also the end of the previous section.
    </div>
    <div>
      <h2>Methodology</h2>
      <p>What would be the optimal way of doing this?</p>
      <ul>
        <li>
          <h3>Recursion</h3>
          One method is <strong>recursion</strong>. But how do we keep state (section limits)?
          <ul>
            <li><h4>First Approach</h4><p>Pass state via method arguments</p></li>
            <li><h4>Second Apporach</h4><p>Pass state via method return values</p></li>
          </ul>
          <p>There are also <strong>tree-based</strong> methods.</p>
        </li>
        <li>
          <h3>Tree traversal</h3>
          Another method is <strong>tree traversal</strong>. But how do we keep state (section limits)?
          <ol>
            <li><h4>Depth-First Search</h4><p>Options: <b>preorder</b>, <b>inorder</b>, <b>postorder</b></p></li>
            <li><h4>Breadth-First Search</h4><p>Just <b>BFS</b>.</li>
          </ol>
        </li>
      </ul>
    </div>
    <div>
      <h2>Conclusion</h2>
      <p>Flattening (shredding) an <strong>HTML</strong> document using predefined tags<br/>
      (e.g., HTML header tags like &lt;h1&gt;) is a fascinating problem.</p>
    </div>  
  </body>
</html>

score 1 · Accepted Answer

我相信它可以在某种程度上由 css 选择器处理，比如这个：

:has(:is(h1,h2,h3,h4,h5) ~ p)

这将返回包含标题和同级段落的元素的平面列表。

如果您想在您的应用程序模型中维护分层布局，则必须递归迭代上述结果并对每个结果执行相同的选择器（以获取它们的内部部分）。或者简单地遍历树以查看什么是什么的父级。

html - 使用 Jsoup 扁平化 HTML 文档

1 回答 1

Related

Reference