1

I'm trying to parse a RTF file using Apache Tika. Inside the file there is a table with several columns.

The problem is that the parser writes out the result without any information in which column the value was.

What I'm doing right now is:

AutoDetectParser adp = new AutoDetectParser(tc);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(file);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
BodyContentHandler handler = new BodyContentHandler();

InputStream fis = new FileInputStream(file);

adp.parse(fis, handler, metadata, new ParseContext());

fis.close();
System.out.println(handler.toString());

It works but I need to know like meta-information.

Is there already a Handler which outputs something like HTML with a structure of the read RTF file?

4

1 回答 1

2

我建议不要向 Tika 询问纯文本版本,然后想知道所有漂亮的 HTML 信息都去了哪里,而只是向 Tika 询问作为 XHTML 的文档。然后,您将能够处理它以在您的 RTF 文件中找到您想要的信息

如果您查看 Tika 示例或 Tika 单元测试,您会看到相同的模式,这是获取 XHTML 输出的简单方法

    Metadata metadata = new Metadata();

    StringWriter sw = new StringWriter();
    SAXTransformerFactory factory = (SAXTransformerFactory)
             SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
    handler.setResult(new StreamResult(sw));

    parser.parse(input, handler, metadata, new ParseContext());

    String xhtml = sw.toString();
于 2012-04-16T15:50:06.420 回答