dom - jsoup 等价于 DOM Range 操作，例如 extractContents() 等

Question

我正在尝试通过 jsoup DOM 模型提取和替换 JavaScript DocumentFragments 的等价物。

有没有人有一些现成的代码来模拟DOM 范围选择和操作？我想选择一个文本范围，它可能会通过多个内联节点（例如 <a>、 等），在此类内联节点的中间开始或结束等。在 JavaScript 中，使用 Range 很容易操作，从其中提取 DocumentFragment，围绕它等。我猜 JavaScript Range 正在根据需要拆分内部节点，以正确处理此类提取和插入。我将如何使用 Java 中的 jsoup 执行此操作？

编辑：只是大声思考如何做到这一点 - 可能需要在我的范围内搜索“峰值”元素，然后转到范围的开始和结束，并通过跳跃将它们“提升”到“峰值水平”如果我的开始是孩子没有，则给父母。0，或者在范围开始元素之前拆分元素子列表......如果有这样的代码准备好，我宁愿重新使用它，否则将不得不从头开始编写。

2015 年 12 月 18 日更新：用我开发的工作代码发布了我的答案，见下文。

score 1 · Accepted Answer

这是我承诺的代码，用于将任意范围的 DOM 主体包装到任意 html 标记中，以便于提取、移动、替换、复制/粘贴等操作。

2015 年 12 月 19 日更新通过 wrapRange() 方法变体在文本中间添加了 TextNode 拆分，其中可选偏移到范围应该开始或结束的文本节点。现在可以在 jsoup DOM 模型中进行任意复制/粘贴/移动。

TODO：（为我自己或其他好人）

编写一个示例项目来演示这一点，加上一些测试用例，然后发布到 GitHub。现在没时间了，但似乎在我的应用程序中运行良好（处理来自网页和电子书的 HTML 代码以使用 TTS 大声朗读 - 请参阅Google Play 中的@Voice Aloud Reader 应用程序）

RangeWrapper.java 模块：

import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.parser.Tag;

import java.util.ArrayList;

/**
 * Created by greg on 12/18/2015.
 */
public class RangeWrapper {
    /**
     * Wrap the supplied HTML around the "range" from startEl to endEl.*
     * @param startEl the first element to be included into the range
     * @param endEl the last element to be included into the range
     * @param html HTML to wrap around this element, e.g. 
     * {@code <span class="head"></span>}. Can be arbitrarily deep.
     * @return the wrapping element
     */
    public static Element wrapRange(Node startEl, Node endEl, String html) {
        if (startEl == endEl) { // special case
            return (Element) startEl.wrap(html).parentNode();
        }
        int startDepth = NodeWalker.getNodeDepth(startEl);
        int endDepth = NodeWalker.getNodeDepth(endEl);
        int minDepth = getRangeMinDepth(startEl, endEl);
        int n;
        while (startDepth > minDepth) {
            Element parent = (Element)startEl.parentNode();
            if ((n = startEl.siblingIndex()) > 0) {
                // splitting the parent
                ArrayList<Node> children = new ArrayList<Node>(parent.childNodes());
                Element parent2 = new Element(Tag.valueOf(parent.tagName()), parent.baseUri(), parent.attributes());
                parent.after(parent2);
                for (int i = n; i < children.size(); i++)
                    parent2.appendChild(children.get(i));
                startEl = parent2;
            } else {
                startEl = parent;
            }
            startDepth--;
        }
        while (endDepth > minDepth) {
            Element parent = (Element)endEl.parentNode();
            if ((n = endEl.siblingIndex()) < parent.children().size()-1) {
                // splitting the parent
                ArrayList<Node> children = new ArrayList<Node>(parent.childNodes());
                Element parent2 = new Element(Tag.valueOf(parent.tagName()), parent.baseUri(), parent.attributes());
                parent.before(parent2);
                for (int i = 0; i <= n; i++)
                    parent2.appendChild(children.get(i));
                endEl = parent2;
            } else {
                endEl = parent;
            }
            endDepth--;
        }
        // Now startEl and endEl are on the same depth == minDepth. 
        // Wrap the range with our html string
        Element range = (Element) startEl.wrap(html).parentNode();
        Node nextToAppend;
        do {
            nextToAppend = range.nextSibling();
            // If nextToAppend is null, something is really wrong...
            // Commented out to let it crash and investigate,
            // so far it did not happen.
            //if (nextToAppend == null)
            //    break;
            range.appendChild(nextToAppend);
        } while (nextToAppend != endEl);

        return range;
    }

    /**
     * Wrap the supplied HTML around the "range" from startEl to endEl.*
     * @param startEl the first element to be included into the range
     * @param stOffset if startEl is TextNode, split at this offset
     *                   and include only the tail. Otherwise ignored.
     * @param endEl the last element to be included into the range
     * @param endOffset if endEl is a Text node, split at this offset
     *                    and include only the head. Otherwise ignored.
     * @param html HTML to wrap around this element, e.g. {@code <span class="head"></span>}. Can be arbitrarily deep.
     * @return the wrapping element
     */
    public static Element wrapRange(Node startEl, int stOffset, Node endEl, int endOffset, String html) {
        if (stOffset > 0 && startEl instanceof TextNode) {
            TextNode tn = (TextNode) startEl;
            if (endOffset < tn.getWholeText().length()-1) {
                startEl = tn.splitText(stOffset); // Splits tn and adds tail to DOM, returns tail
            }
        }
        if (endOffset > 0 && endEl instanceof TextNode) {
            TextNode tn = (TextNode) endEl;
            if (endOffset < tn.getWholeText().length()-1) {
                tn.splitText(stOffset); // Splits tn and adds tail to DOM, we take head == original endEl
            }
        }

        return wrapRange(startEl, endEl, html);
    }


    /**
     * Calculate the depth of the range between the two given nodes, relative to body.
     * The body has depth 0.
     * @param startNode the first element to be included into the range
     * @param endNode the last element to be included into the range
     * @return minimum depth found in the range
     */
    public static int getRangeMinDepth(final Node startNode, final Node endNode) {
        class DepthVisitor implements NodeWalker.NodeWalkVisitor {
            private int _minDepth = Integer.MAX_VALUE;
            public boolean head(Node node, int depth) {
                if (depth < _minDepth)
                    _minDepth = depth;
                return true;
            }
            public boolean tail(Node node, int depth) {return true;}
            int getMinDepth() { return _minDepth; }
        };
        DepthVisitor visitor = new DepthVisitor();
        NodeWalker nw = new NodeWalker(visitor);
        nw.walk(startNode, endNode);
        return visitor.getMinDepth();
    }
}

...以及上面代码使用的 NodeWalker.java，改编自 jsoup 包中的 NodeTraversor 和 NodeVisitor 类：

import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.NodeVisitor;

/**
 * Depth-first node traversor. Use to iterate through all nodes under and including the specified root node.
 * <p>
 * This implementation does not use recursion, so a deep DOM does not risk blowing the stack.
 * </p>
 */
public class NodeWalker {
    private NodeWalkVisitor visitor;

    /**
     * Create a new traversor.
     * @param visitor a class implementing the {@link NodeVisitor} interface, to be called when visiting each node.
     */
    public NodeWalker(NodeWalkVisitor visitor) {
        this.visitor = visitor;
    }

    /**
     * Start a depth-first traverse of the whole body and all of its descendants.
     * @param startNode the arbitrary start point node point within body to traverse from.
     * @param endNode the arbitrary end point node point within body where we stop traverse.
     *                Can be null, in which case we walk until the end of the body.
     */
    public void walk(Node startNode, Node endNode) {
        Node node = startNode;
        int depth = getNodeDepth(startNode); // let's calulate depth relative to body, body is depth 0

        while (node != null) {
            if (!visitor.head(node, depth))
                break;
            if (node.childNodeSize() > 0) {
                node = node.childNode(0);
                depth++;
            } else {
                while (node.nextSibling() == null && depth > 0) {
                    if (!visitor.tail(node, depth) || node == endNode)
                        return;
                    node = node.parentNode();
                    depth--;
                }
                if (!visitor.tail(node, depth) || node == endNode)
                    break;
                node = node.nextSibling();
            }
        }
    }

// The walkBack() was not needed, but leaving it here, may be useful for something...
//    /**
//     * Start a depth-first backward traverse of the whole body and all of its descendants.
//     * @param startNode the arbitrary start point node point within body to traverse from.
//     * @param endNode the arbitrary end point node point within body where we stop traverse.
//     *                Can be null, in which case we walk until the end of the body.
//     */
//    public void walkBack(Node startNode, Node endNode) {
//        Node node = startNode;
//        int depth = getNodeDepth(startNode); // let's calulate depth relative to body, body is depth 0
//
//        while (node != null) {
//            if (!visitor.tail(node, depth))
//                break;
//            if (node.childNodeSize() > 0) {
//                node = node.childNode(node.childNodeSize() - 1);
//                depth++;
//            } else {
//                while (node.previousSibling() == null && depth > 0) {
//                    if (!visitor.head(node, depth) || node == endNode)
//                        return;
//                    node = node.parentNode();
//                    depth--;
//                }
//                if (!visitor.head(node, depth) || node == endNode)
//                    break;
//                node = node.previousSibling();
//            }
//        }
//    }

    /**
     * Calculate the depth of the given node relative to body. The body has depth 0.
     * @param givenNode the node within the body to calculate depth for.
     * @return the depth of the givenNode
     */
    public static int getNodeDepth(Node givenNode) {
        Node node = givenNode;
        int depth = 0; // let's calulate depth relative to body, body is depth 0
        if (!(node instanceof Element) || !"body".equals(((Element) node).tagName())) {
            do {
                depth++;
                node = (Element)node.parentNode();
            } while (node != null && !"body".equals(((Element) node).tagName()));
        }
        return depth;
    }

    public interface NodeWalkVisitor {
        /**
         * Callback for when a node is first visited.
         *
         * @param node the node being visited.
         * @param depth the depth of the node, relative to the root node. E.g., the root node has depth 0, and a child node
         * of that will have depth 1.
         * @return true to continue walk, false to abort
         */
        boolean head(Node node, int depth);

        /**
         * Callback for when a node is last visited, after all of its descendants have been visited.
         *
         * @param node the node being visited.
         * @param depth the depth of the node, relative to the root node. E.g., the root node has depth 0, and a child node
         * of that will have depth 1.
         * @return true to continue walk, false to abort
         */
        boolean tail(Node node, int depth);
    }
}

格雷格

score 1 · Accepted Answer

两点：

JSoup 提供了一些将文本节点作为String对象进行操作的方法。
Java 及其生态系统为操作String对象提供了强大的 API。

在从头开始编写 DOM Range 操作之前，您可能会尝试使用上述两个选项找到适合自己的方式。

以下是 JSoup API 中的一些方法：

Element#text() 获取此元素的组合未编码文本作为字符串。
摘自 API：

给定 HTML Hello there now! ， p.text() 返回“Hello there now！”

Element#text(String)将此元素的当前文本替换为传递的未编码文本。

Element#ownText仅获取此元素的未编码文本，而不获取所有子元素的文本。
摘自 API：

例如，给定 HTML Hello there now!，p.ownText() 返回“Hello now！”，而 p.text() 返回“Hello there now！”。请注意，不返回 b 元素中的文本，因为它不是 p 元素的直接子元素。

您可能会发现这两个食谱也很有用：

dom - jsoup 等价于 DOM Range 操作，例如 extractContents() 等

2 回答 2

Related

Reference