0

给定一个 HTML 作为这样的字符串。

<p><strong>This is a text Message.</strong></p>
<ul>
    <li>UL 1</li>
    <li><strong>UL&nbsp;</strong>2</li>
    <li><em>UL 3</em></li>
</ul>
<ol>
    <li style="font-weight: bold;"><strong>First statement</strong></li>
    <li><strong>Second&nbsp;</strong>Statement</li>
    <li>Third <strong>Statement</strong></li>
</ol>
<p>This is another <em>text </em>message.</p>

我想把它格式化成一个excel文本框。它看起来像这样。 在此处输入图像描述

我尝试编写一些基本代码来提取正确的值。它没有按预期工作。

代码的作用如下。

  1. 从 HTML 中获取所有子元素。
  2. 按顺序处理它们并从中提取文本。

问题是由于etc中的strong嵌套标签造成的......p

public void CreateHtmlToRichText() {
//        This is a text Message.
//
//        UL 1
//        UL 2
//        UL 3
//        First statement
//        Second Statement
//        Third Statement
//
//        This is another text message.

        String htmlString = "<p><strong>This is a text Message.</strong></p>\n" +
                "<ul>\n" +
                "    <li>UL 1</li>\n" +
                "    <li><strong>UL&nbsp;</strong>2</li>\n" +
                "    <li><em>UL 3</em></li>\n" +
                "</ul>\n" +
                "<ol>\n" +
                "    <li style=\"font-weight: bold;\"><strong>First statement</strong></li>\n" +
                "    <li><strong>Second&nbsp;</strong>Statement</li>\n" +
                "    <li>Third <strong>Statement</strong></li>\n" +
                "</ol>\n" +
                "<p>This is another <em>text </em>message.</p>";
        System.out.println(htmlString);
        Document document = Jsoup.parse(htmlString);
        Elements elements = document.body().children().select("*");
        System.out.println("****************************");
        System.out.println(elements.size());
        Map<Integer, String> paragraphMap = new HashMap<>();
        Set<String> newLineSet = new HashSet<>();
        newLineSet.add("p");
        newLineSet.add("ol");
        newLineSet.add("ul");
        newLineSet.add("li");
        newLineSet.add("br");

        int lineNumber = 0;
        for (Element element : elements) {
            String ownText = element.ownText();
            String tagName = element.tagName();

            System.out.println("added " + lineNumber + ", " + ownText);
            if (newLineSet.contains(tagName)) {
                lineNumber++;
            }
            paragraphMap.put(lineNumber, paragraphMap.getOrDefault(lineNumber, "") + " " + ownText);

            System.out.println("*********************");
            System.out.println(element);
            System.out.println("Tag : " + tagName);
            System.out.println("Own Text : " + ownText);
            System.out.println("*********************");
        }

        System.out.println(paragraphMap);

        for (int line = 0; line <= lineNumber; line++) {
            if (paragraphMap.containsKey(line)) {
                System.out.println(paragraphMap.get(line).strip());
            }
        }
    }

输出:

This is a text Message.

UL 1
2 UL
UL 3

First statement
Statement Second
Third Statement
This is another message. text
4

0 回答 0