java - 用Java解析文档结构

Question

我们需要使用 Java 从给定的文本文档中获取树状结构。使用的文件类型应该是通用和开放的（rtf、odt、...）。目前我们使用 Apache Tika 从多个文档中解析纯文本。

我们应该使用什么文件类型和 API 才能最可靠地解析正确的结构？如果 Tika 可以做到这一点，我很乐意看到任何演示。

例如，我们应该从给定的文档中获取这种数据：

Main Heading
  Heading 1
    Heading 1.1
  Heading 2
    Heading 2.2

主标题是论文的标题。论文有两个主要标题，标题 1 和标题 2，它们都有一个子标题。我们还应该获取每个标题下的内容（段落文本）。

任何帮助表示赞赏。

score 3 · Accepted Answer

OpenDocument (.odt) 实际上是一个包含多个 xml 文件的 zip 包。Content.xml 包含文档的实际文本内容。我们对标题感兴趣，它们可以在 text:h 标签中找到。阅读更多关于ODT的信息。

我找到了一个使用QueryPath从 .odt 文件中提取标题的实现。

由于最初的问题是关于 Java 的，所以在这里。首先，我们需要使用 ZipFile 访问 content.xml。然后我们使用 SAX 从 content.xml 中解析出 xml 内容。示例代码只是打印出所有标题：

Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions



Sample code:

    public void printHeadingsOfOdtFIle(File odtFile) {

    try {

        ZipFile zFile = new ZipFile(odtFile);
        System.out.println(zFile.getName());

        ZipEntry contentFile = zFile.getEntry("content.xml");

        System.out.println(contentFile.getName());
        System.out.println(contentFile.getSize());
        XMLReader xr = XMLReaderFactory.createXMLReader();
        OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
        xr.setContentHandler(handler);

        xr.parse(new InputSource(zFile.getInputStream(contentFile)));

    } catch (Exception e) {

        e.printStackTrace();

    }

}

public static void main(String[] args) {

    new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));

}


Relevant parts of used ContentHandler look like this:

    @Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {

    temp = "";

    if("text:h".equals(qName)) {

        String headingLevel = atts.getValue("text:outline-level");

        if(headingLevel != null) {

            System.out.print(headingLevel + " ");

        }

    }

}

@Override
public void characters(char[] ch, int start, int length) throws SAXException {

    char[] subArray = new char[length];
    System.arraycopy(ch, start, subArray, 0, length);
    temp = new String(subArray);

    fullText.append(temp);
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {

    if("text:h".equals(qName)) {

        System.out.println(temp);

    }

}

java - 用Java解析文档结构

1 回答 1

Related

Reference