我认为你需要遍历树。元素上text()的结果将是元素的所有文本,包括子元素中的文本。希望以下代码对您有所帮助:
import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
public class ScreenScrape {
public static void main(String[] args) throws IOException {
String content = FileUtils.readFileToString(new File("test.html"));
Document doc = Jsoup.parse(content);
Element body = doc.body();
//System.out.println(body.toString());
StringBuilder sb = new StringBuilder();
traverse(body, sb);
System.out.println(sb.toString());
}
private static void traverse(Node n, StringBuilder sb) {
if (n instanceof Element) {
sb.append('<');
sb.append(n.nodeName());
if (n.attributes().size() > 0) {
sb.append(n.attributes().toString());
}
sb.append('>');
}
if (n instanceof TextNode) {
TextNode tn = (TextNode) n;
if (!tn.isBlank()) {
sb.append(spanifyText(tn.text()));
}
}
for (Node c : n.childNodes()) {
traverse(c, sb);
}
if (n instanceof Element) {
sb.append("</");
sb.append(n.nodeName());
sb.append('>');
}
}
private static String spanifyText(String text){
StringBuilder sb = new StringBuilder();
StringTokenizer st = new StringTokenizer(text);
String token;
while (st.hasMoreTokens()) {
token = st.nextToken();
if(token.length() > 3){
sb.append("<span>");
sb.append(token);
sb.append("</span>");
} else {
sb.append(token);
}
sb.append(' ');
}
return sb.substring(0, sb.length() - 1).toString();
}
}
更新
使用 Jonathan 的新 Jsoup List element.textNode()方法并将其与 MarcoS 建议的 NodeTraversor/NodeVisitor 技术相结合,我想出了(尽管我在遍历它时修改了树 - 可能是个坏主意):
Document doc = Jsoup.parse(content);
Element body = doc.body();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {
@Override
public void tail(Node node, int depth) {
if (node instanceof Element) {
boolean foundLongWord;
Element elem = (Element) node;
Element span;
String token;
StringTokenizer st;
ArrayList<Node> changedNodes;
Node currentNode;
for (TextNode tn : elem.textNodes()) {
foundLongWord = Boolean.FALSE;
changedNodes = new ArrayList<Node>();
st = new StringTokenizer(tn.text());
while (st.hasMoreTokens()) {
token = st.nextToken();
if (token.length() > 3) {
foundLongWord = Boolean.TRUE;
span = new Element(Tag.valueOf("span"), elem.baseUri());
span.appendText(token);
changedNodes.add(span);
} else {
changedNodes.add(new TextNode(token + " ", elem.baseUri()));
}
}
if (foundLongWord) {
currentNode = changedNodes.remove(0);
tn.replaceWith(currentNode);
for (Node n : changedNodes) {
currentNode.after(n);
currentNode = n;
}
}
}
}
}
@Override
public void head(Node node, int depth) {
}
});
nd.traverse(body);
System.out.println(body.toString());