java - 使用 JSoup 修改内存中的 HTML

Question

最近被推荐使用JSoup来解析和修改HTML文档。

但是，如果我有一个要修改的 HTML 文档（发送、存储在其他地方等），我该如何在不更改原始文档的情况下进行该操作呢？

假设我有一个这样的 HTML 文件：

<html>
 <head></head>
 <body>     
  <p></p>
  <h2>Title: title</h2>
  <p></p>
  <p>Name: </p>
  <p>Address: </p>
  <p>Phone Number: </p>
 </body>
</html>

我想在不修改原始 HTML 文件的情况下为姓名、地址、电话号码和任何其他我想要的信息填写适当的数据，我该如何使用 JSoup 呢？

score 1 · Accepted Answer

一个可能更简单的解决方案是修改您的模板以具有如下占位符：

<html>
  <head></head>
  <body>     
    <p></p>
    <h2>Title: title</h2>
    <p></p>
    <p>Name: <span id="name"></span></p>
    <p>Address: <span id="address"></span></p>
    <p>Phone Number: <span id="phone"></span></p>
 </body>
</html>

然后以这种方式加载您的文档：

    Document doc = Jsoup.parse("" +
        "<html>\n" +
        "  <head></head>\n" +
        "  <body>     \n" +
        "    <p></p>\n" +
        "    <h2>Title: title</h2>\n" +
        "    <p></p>\n" +
        "    <p>Name: <span id=\"name\"></span></p>\n" +
        "    <p>Address: <span id=\"address\"></span></p>\n" +
        "    <p>Phone Number: <span id=\"phone\"></span></p>\n" +
        " </body>\n" +
        "</html>");

    doc.getElementById("name").text("Andrey");
    doc.getElementById("address").text("Stackoverflow.com");
    doc.getElementById("phone").text("secret!");

    System.out.println(doc.html());

这将填写表格。

score 0 · Accepted Answer

@MarcoS 有一个出色的解决方案，使用 NodeTraversor 在https://stackoverflow.com/a/6594828/1861357上制作要更改的节点列表，我只稍微修改了他的方法，将节点（一组标签）替换为节点中的数据以及您要添加的任何信息。

要将字符串存储在内存中，我使用静态StringBuilder将 HTML 保存在内存中。

首先，我们读入 HTML 文件（手动指定，可以更改），然后我们进行一系列检查，以使用我们想要的任何数据更改任何节点。

我在 MarcoS 的解决方案中没有解决的一个问题是它拆分了每个单词，而不是查看一行。但是，我只是将“-”用于多个单词，否则它会将字符串直接放在该单词之后。

所以一个完整的实现：

import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;

public class memoryHTML
{
    static String htmlLocation = "C:\\Users\\User\\";               
    static String fileName = "blah";                            // Just for demonstration, easily modified.
    static StringBuilder buildTmpHTML = new StringBuilder();
    static StringBuilder buildHTML = new StringBuilder();
    static String name = "John Doe";
    static String address = "42 University Dr., Somewhere, Someplace";
    static String phoneNumber = "(123) 456-7890";

    public static void main(String[] args)
    {
        // You can send it the full path with the filename. I split them up because I used this for multiple files.
        readHTML(htmlLocation, fileName);
        modifyHTML();

        System.out.println(buildHTML.toString());

        // You need to clear the StringBuilder Object or it will remain in memory and build on each run.
        buildTmpHTML.setLength(0);
        buildHTML.setLength(0);

        System.exit(0);
    }

    // Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML()
    public static void readHTML(String directory, String fileName)
    {
        try
        {
            BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html"));

            String line;
            while((line = br.readLine()) != null)
            {
                buildTmpHTML.append(line);
            }
            br.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
            System.exit(1);
        }
    }

    // Excellent method of parsing and modifying nodes in HTML files by @MarcoS at https://stackoverflow.com/a/6594828/1861357
    // It has its small problems, but it does the trick.
    public static void modifyHTML()
    {
        String htmld = buildTmpHTML.toString();
        Document doc = Jsoup.parse(htmld);

        final List<TextNode> nodesToChange = new ArrayList<TextNode>();

        NodeTraversor nd  = new NodeTraversor(new NodeVisitor() 
        {
          @Override
          public void tail(Node node, int depth) 
          {
            if (node instanceof TextNode) 
            {
              TextNode textNode = (TextNode) node;
              nodesToChange.add(textNode);
            }
          }

          @Override
          public void head(Node node, int depth) 
          {        
          }
        });

        nd.traverse(doc.body());

        for (TextNode textNode : nodesToChange) 
        {
          Node newNode = buildElementForText(textNode);
          textNode.replaceWith(newNode);
        }

        buildHTML.append(doc.html());
    }

    private static Node buildElementForText(TextNode textNode) 
      {
        String text = textNode.getWholeText();
        String[] words = text.trim().split(" ");
        Set<String> units = new HashSet<String>();
        for (String word : words) 
            units.add(word);

        String newText = text;
        for (String rpl : units) 
        {
            if(rpl.contains("Name"))
                newText = newText.replaceAll(rpl, "" + rpl + " " + name:));
            if(rpl.contains("Address") || rpl.contains("Residence"))
                newText = newText.replaceAll(rpl, "" + rpl + " " + address);
            if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber"))
                newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber);
        }
        return new DataNode(newText, textNode.baseUri());
      }

你会得到这个 HTML（记住我把“电话号码”改成了“电话号码”）：

<html>
 <head></head>
 <body>     
  <p></p>
  <h2>Title: title</h2>
  <p></p>
  <p>Name: John Doe </p>
  <p>Address: 42 University Dr., Somewhere, Someplace</p>
  <p>Phone-Number: (123) 456-7890</p>
 </body>
</html>

java - 使用 JSoup 修改内存中的 HTML

2 回答 2

Related

Reference