1

我有一个要解析的文档,它包含 html,我想将其转换为 from htmltoplaintext但有格式。

示例摘录

<p>My simple paragragh</p>
<p>My paragragh with <a>Link</a></p>
<p>My paragragh with an <img/></p>

我可以很容易地做这个简单的例子(也许不是很有效)

StringBuilder sb = new StringBuilder();

for(Element element : doc.getAllElements()){
    if(element.tag().getName().equals("p")){
        sb.append(element.text());
        sb.append("\n\n");
    }
}

是否有可能(以及我将如何做)在正确的位置插入内联元素的输出。一个例子:

<p>My paragragh with <a>Link</a> in the middle</p> 

会成为:

My paragragh with (Location: http://mylink.com) in the middle
4

1 回答 1

1

您可以将每个链接标签替换为TextNode

final String html = "<p>My simple paragragh</p>\n"
        + "<p>My paragragh with <a>Link</a></p>\n"
        + "<p>My paragragh with an <img/></p>";

Document doc = Jsoup.parse(html, "");

// Select all link-tags and replace them with TextNodes
for( Element element : doc.select("a") )
{
    element.replaceWith(new TextNode("(Location: http://mylink.com)", ""));
}


StringBuilder sb = new StringBuilder();

// Format as needed
for( Element element : doc.select("*") )
{
    // An alternative to the 'if'-statement
    switch(element.tagName())
    {
        case "p":
            sb.append(element.text()).append("\n\n");
            break;
        // Maybe you have to format some other tags here too ...
    }
}

System.out.println(sb);
于 2013-10-29T19:38:20.530 回答