我使用以下代码从 .odt 文件中提取文本:
public class OpenOfficeParser {
StringBuffer TextBuffer;
public OpenOfficeParser() {}
//Process text elements recursively
public void processElement(Object o) {
if (o instanceof Element) {
Element e = (Element) o;
String elementName = e.getQualifiedName();
if (elementName.startsWith("text")) {
if (elementName.equals("text:tab")) // add tab for text:tab
TextBuffer.append("\\t");
else if (elementName.equals("text:s")) // add space for text:s
TextBuffer.append(" ");
else {
List children = e.getContent();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
}
else
processElement(child); // Recursively process the child element
}
}
if (elementName.equals("text:p"))
TextBuffer.append("\\n");
}
else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}
public String getText(String fileName) throws Exception {
TextBuffer = new StringBuffer();
//Unzip the openOffice Document
ZipFile zipFile = new ZipFile(fileName);
Enumeration entries = zipFile.entries();
ZipEntry entry;
while(entries.hasMoreElements()) {
entry = (ZipEntry) entries.nextElement();
if (entry.getName().equals("content.xml")) {
TextBuffer = new StringBuffer();
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(zipFile.getInputStream(entry));
Element rootElement = doc.getRootElement();
processElement(rootElement);
break;
}
}
System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
return TextBuffer.toString();
}
}
现在我的问题发生在使用getText()
方法返回的字符串时。我运行程序并从 .odt 中提取了一些文本,这是一段提取的文本:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
所以我尝试了这个
System.out.println( TextBuffer.toString().split("\\n"));
我收到的输出是:
substring: [Ljava.lang.String;@505bb829
我也试过这个:
System.out.println( TextBuffer.toString().trim() );
但打印的字符串没有变化。
为什么会有这种行为?我该怎么做才能正确解析该字符串?而且,如果我想将每个以“\n\n”结尾的子字符串添加到数组 [i],我该怎么办?
编辑:对不起,我在示例中犯了一个错误,因为我忘记了split()
返回数组。问题是它返回一个包含一行的数组,所以我要问的是为什么要这样做:
System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));
对我在示例中编写的字符串没有影响。
还有这个:
System.out.println( TextBuffer.toString().trim() );
has no effects on the original string, it just prints the original string.
I want to example the reason why I want to use the split()
, it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:
my originale string:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
after parsing I would print each line of an array and the output should be:
line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....