apache - Gate Embedded 中的 Apache Tika

Question

所以我需要为我的门嵌入式应用程序加载一个 pdf 文档。我尝试使用 apache tika 将 pdf 解析为字符串，但门的 ANNIE 工具无法在字符串中查找注释。我听说过 tikaformat，但找不到任何使用示例。

否则有人会成功加载一些 tikaformat 或 pdf 文档的示例吗？

score 1 · Accepted Answer

我想我回答这个问题为时已晚但我以后的任何人都有同样的问题在这里答案

首先使用 Tika 提取任意文件类型的内容

   File file = new File("file path");
   //parse method parameters
   Parser parser = new AutoDetectParser();
   BodyContentHandler handler = new BodyContentHandler();
   Metadata metadata = new Metadata();
   FileInputStream inputstream = new FileInputStream(file);
   ParseContext context = new ParseContext();
   //parsing the file
   parser.parse(inputstream, handler, metadata, context);

初始化门后 Gate.init();

   Corpus corpus = Factory.newCorpus("SegmenterCorpus");
   Document document = Factory.newDocument(handler.toString());// **handler from tika parser to extract the content of a document** 
   corpus.add(document); 
   pipeline.setCorpus(corpus); 
   pipeline.execute();

有关如何使用的更多信息，Tika您可以查看TIKA 教程，它非常有用，并逐步学习如何使用 tika

apache - Gate Embedded 中的 Apache Tika

1 回答 1

Related

Reference