0

我使用以下代码从 .odt 文件中提取文本:

public class OpenOfficeParser {

StringBuffer TextBuffer;

public OpenOfficeParser() {}

//Process text elements recursively
public void processElement(Object o) {

    if (o instanceof Element) {

        Element e = (Element) o;
        String elementName = e.getQualifiedName();

        if (elementName.startsWith("text")) {

            if (elementName.equals("text:tab")) // add tab for text:tab
                TextBuffer.append("\\t");
            else if (elementName.equals("text:s"))  // add space for text:s
                TextBuffer.append(" ");
            else {
                List children = e.getContent();
                Iterator iterator = children.iterator();

                while (iterator.hasNext()) {

                    Object child = iterator.next();
                    //If Child is a Text Node, then append the text
                    if (child instanceof Text) { 
                        Text t = (Text) child;
                        TextBuffer.append(t.getValue());
                    }
                    else
                    processElement(child); // Recursively process the child element                   
                }                   
            }
            if (elementName.equals("text:p"))
                TextBuffer.append("\\n");                   
        }
        else {
            List non_text_list = e.getContent();
            Iterator it = non_text_list.iterator();
            while (it.hasNext()) {
                Object non_text_child = it.next();
                processElement(non_text_child);                   
            }
        }               
    }
}

public String getText(String fileName) throws Exception {
    TextBuffer = new StringBuffer();

    //Unzip the openOffice Document
    ZipFile zipFile = new ZipFile(fileName);
    Enumeration entries = zipFile.entries();
    ZipEntry entry;

    while(entries.hasMoreElements()) {
        entry = (ZipEntry) entries.nextElement();

        if (entry.getName().equals("content.xml")) {

            TextBuffer = new StringBuffer();               
            SAXBuilder sax = new SAXBuilder();
            Document doc = sax.build(zipFile.getInputStream(entry));
            Element rootElement = doc.getRootElement();
            processElement(rootElement);
            break;
        }
    }    


 System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
        return TextBuffer.toString();       
    }     
}

现在我的问题发生在使用getText()方法返回的字符串时。我运行程序并从 .odt 中提取了一些文本,这是一段提取的文本:

(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

所以我尝试了这个

System.out.println( TextBuffer.toString().split("\\n")); 

我收到的输出是:

substring: [Ljava.lang.String;@505bb829

我也试过这个:

System.out.println( TextBuffer.toString().trim() );

但打印的字符串没有变化。

为什么会有这种行为?我该怎么做才能正确解析该字符串?而且,如果我想将每个以“\n\n”结尾的子字符串添加到数组 [i],我该怎么办?

编辑:对不起,我在示例中犯了一个错误,因为我忘记了split()返回数组。问题是它返回一个包含一行的数组,所以我要问的是为什么要这样做:

System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));

对我在示例中编写的字符串没有影响。

还有这个:

    System.out.println( TextBuffer.toString().trim() );

has no effects on the original string, it just prints the original string.

I want to example the reason why I want to use the split(), it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:

my originale string:

    (no hi virtual x oy)\n\n house cat \n open it \n\n trying to....

after parsing I would print each line of an array and the output should be:

line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....
4

1 回答 1

1

If I understood your question correctly I would do something like this

String str = "(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....";

List<String> al = new ArrayList<String>(Arrays.asList(str.toString()
            .split("\\n")));

al.removeAll(Arrays.asList("", null)); // remove empty or null string

for (int i = 0; i< al.size(); i++) {
    System.out.println("Line " + i + " : " + al.get(i).trim());
}

Output

Line 0 : (no hi virtual x oy)
Line 1 : house cat
Line 2 : open it
Line 3 : trying to....
于 2013-06-06T19:49:48.053 回答