apache-tika - Tika--从复合文档中提取不同的项目

Question

问题： 假设带有附件的电子邮件消息（假设是 JPEG 附件）。如何解析（不使用 Tika 外观类）电子邮件并返回不同的部分——a）电子邮件文本内容和 b）电子邮件附件？

配置： Tika 1.2 Java 1.7

详细信息： 我已经能够正确解析基本电子邮件格式的电子邮件。但是，在解析之后，我需要知道 a) 电子邮件的文本内容和 b) 电子邮件任何附件的内容。我会将这些项目存储在我的数据库中，本质上是带有子附件的父电子邮件。

我无法弄清楚的是如何“取回”不同的部分并知道父电子邮件具有附件并能够单独存储引用到邮件的那些附件。我相信，这本质上类似于提取 ZipFile 内容。

代码示例：

 private Message processDocument(String fullfilepath) {
     try {
         File filename = new File(fullfilepath) ;
         return this.processDocument(filename) ;
     } catch (NullPointerException npe) {
        Message error = new Message(false) ;
         error.appendErrorMessage("The file name was null.") ;
         return error ;
     } 
 }

private Message processDocument(File filename) {
    InputStream stream = null;
    try {
       stream = new FileInputStream(filename) ;
    } catch (FileNotFoundException fnfe) {
        // TODO Auto-generated catch block
        fnfe.printStackTrace();
        System.out.println("FileNotFoundException") ;
        return diag ;
    }

int writelimit = -1 ; 
ContentHandler texthandler = new BodyContentHandler(writelimit); 
this.safehandlerbodytext = new SafeContentHandler(texthandler);
this.meta = new Metadata() ;
ParseContext context = new ParseContext() ;

AutoDetectParser autodetectparser = new AutoDetectParser() ;

try {

    autodetectparser.parse(
        stream,
        texthandler,
        meta,
        context) ;

    this.documenttype = meta.get("Content-Type") ;

    diag.setSuccessful(true);

} catch (IOException ioe) {
    // if the document stream could not be read
    System.out.println("TikaTextExtractorHelper IOException " + ioe.getMessage()) ;
    //FIXME -- add real handling

} catch (SAXException se) {
    // if the SAX events could not be processed
    System.out.println("TikaTextExtractorHelper SAXException " + se.getMessage()) ;
  //FIXME -- add real handling

} catch (TikaException te) {
    // if the document could not be parsed
    System.out.println("TikaTextExtractorHelper TikaException " + te.getMessage()) ;
    System.out.println("Exception Filename = " + filename.getName()) ;
  //FIXME -- add real handling

}

}

score 1 · Accepted Answer

当 Tika 遇到嵌入文档时，它会转到 ParseContext 以查看您是否提供了递归解析器。如果您有，它将使用它来处理任何嵌入式资源。如果没有，它会跳过。

因此，您可能想要做的是：

public static class HandleEmbeddedParser extends AbstractParser {
   public List<File> found = new ArrayList<File>();
   Set<MediaType> getSupportedTypes(ParseContext context) {
       // Return what you want to handle
       HashSet<MediaType> types = new HashSet<MediaType>();
       types.put(MediaType.application("pdf"));
       types.put(MediaType.application("zip"));
       return types;
   }
   void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context
   ) throws IOException {
       // Do something with the child documents
       // eg save to disk
       File f = File.createTempFile("tika","tmp");
       found.add(f);

       FileOutputStream fout = new FileOutputStream(f);
       IOUtils.copy(stream,fout);
       fout.close();
   }
}

ParseContext context = new ParseContext();
context.set(Parser.class, new HandleEmbeddedParser();
parser.parse(....);

apache-tika - Tika--从复合文档中提取不同的项目

1 回答 1

Related

Reference