java - 使用 Apache tika 获取 MimeType 子类型

Question

对于 odt、ppt、pptx、xlsx 等文档，我需要获取 iana.org MediaType 而不是 application/zip 或 application/x-tika-msoffice。

如果您查看 mimetypes.xml，则 mimeType 元素由 iana.org mime-type 和“sub-class-of”组成

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

如何获取 iana.org mime-type 名称而不是父类型名称？

在测试 mime 类型检测时，我会：

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

试验结果：

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

有没有办法从 mimetypes.xml 获取实际的子类型？而不是 x-tika-msoffice 或 application/zip ？

此外，我从来没有得到 application/x-tika-ooxml，而是 xlsx、docx、pptx 文档的 application/zip。

score 31 · Accepted Answer

最初，Tika 仅支持通过 Mime Magic 或文件扩展名 (glob) 进行检测，因为这是 Tika 之前的所有大多数 mime 检测。

由于在检测容器格式时 Mime Magic 和 glob 存在问题，因此决定在 Tika 中添加一些新的检测器来处理这些问题。Container Aware Detectors 获取整个文件，打开并处理容器，然后根据内容计算出确切的文件类型。最初，您需要明确地调用它们，但随后它们被包裹起来ContainerAwareDetector，您将在其中看到一些答案。

从那时起，Tika 添加了一个服务加载器模式，最初是针对 Parsers 的。这允许类在存在时自动加载，并以一种通用的方式来识别哪些是合适的并使用它们。这种支持随后也扩展到涵盖探测器，此时ContainerAwareDetector可以移除旧的以支持更清洁的东西。

如果您使用的是 Tika 1.2 或更高版本，并且想要准确检测所有格式，包括容器格式，您需要执行以下操作：

 TikaConfig config = TikaConfig.getDefaultConfig();
 Detector detector = config.getDetector();

 TikaInputStream stream = TikaInputStream.get(fileOrStream);

 Metadata metadata = new Metadata();
 metadata.add(Metadata.RESOURCE_NAME_KEY, filenameWithExtension);
 MediaType mediaType = detector.detect(stream, metadata);

如果您仅使用 Core Tika jar (tika-core-1.2-....) 运行此程序，那么唯一存在的检测器将是 mime magics 检测器，您将获得仅基于 magic + glob 的旧式检测. 但是，如果您同时使用 Core 和 Parser Tika jars（加上它们的依赖项），或者从 Tika App（自动包括核心 + 解析器 + 依赖项）运行它，那么 DefaultDetector 将使用所有各种不同的容器检测器来处理您的文件. 如果您的文件是基于 zip 的，则检测将包括处理 zip 结构以根据其中的内容识别文件类型。这将为您提供所需的高精度检测，而无需依次调用许多不同的解析器。DefaultDetector将使用所有可用的检测器。

score 5 · Accepted Answer

对于其他有类似问题但使用较新 Tika 版本的人，这应该可以解决问题：

使用ZipContainerDetector，因为你可能没有ContainerAwareDetector了。
给出检测器TikaInputStream的detect()方法，以确保 tika 可以分析正确的 mime 类型。

我的示例代码如下所示：

public static String getMimeType(final Document p_document)
{
    try
    {
        Metadata metadata = new Metadata();
        metadata.add(Metadata.RESOURCE_NAME_KEY, p_document.getDocName());

        Detector detector = getDefaultDectector();

        LogMF.debug(log, "Trying to detect mime type with detector {0}.", detector);
        TikaInputStream inputStream = TikaInputStream.get(p_document.getData(), metadata);

        return detector.detect(inputStream, metadata).toString();
    }
    catch (Throwable t)
    {
        log.error("Error while determining mime-type of " + p_document);
    }

    return null;
}

private static Detector getDefaultDectector()
{
    if (detector == null)
    {
        List<Detector> detectors = new ArrayList<>();

        // zip compressed container types
        detectors.add(new ZipContainerDetector());
        // Microsoft stuff
        detectors.add(new POIFSContainerDetector());
        // mime magic detection as fallback
        detectors.add(MimeTypes.getDefaultMimeTypes());

        detector = new CompositeDetector(detectors);
    }

    return detector;
}

请注意，Document该类是我的域模型的一部分。所以你肯定会在那条线上有类似的东西。

我希望有人可以使用它。

score 2 · Accepted Answer

tika-core 中的默认字节模式检测规则只能检测所有 MS Office 文档类型使用的通用 OLE2 或 ZIP 格式。您想使用 ContainerAwareDetector 进行这种检测 afaik。并使用 MimeTypes 检测器作为其后备检测器。尝试这个：

public MediaType getContentType(InputStream is, String fileName) {
    MediaType mediaType;
    Metadata md = new Metadata();
    md.set(Metadata.RESOURCE_NAME_KEY, fileName);
    Detector detector = new ContainerAwareDetector(tikaConfig.getMimeRepository());

    try {
        mediaType = detector.detect(is, md);
    } catch (IOException ioe) {
        whatever;
    }
    return mediaType;
}

这样你的测试应该通过

score 2 · Accepted Answer

您可以使用自定义 tika 配置文件：

MimeTypes mimes=MimeTypesFactory.create(Thread.currentThread()
   .getContextClassLoader().getResource("tika-custom-MimeTypes.xml"));
Metadata metadata = new Metadata();
metadata.add(Metadata.RESOURCE_NAME_KEY, file.getName());
tis = TikaInputStream.get(file);
String mimetype = new  DefaultDetector(mimes).detect(tis,metadata).toString();

在 WEB-INF/classes 中将“tika-custom-MimeTypes.xml”与您的更改放在一起：

就我而言：

<mime-type type="video/mp4">
    <magic priority="60">
      <match value="ftypmp41" type="string" offset="4"/>
      <match value="ftypmp42" type="string" offset="4"/>
      <!-- add -->
      <match value="ftyp" type="string" offset="4"/>
    </magic>
    <glob pattern="*.mp4"/>
    <glob pattern="*.mp4v"/>
    <glob pattern="*.mpg4"/>
    <!-- sub-class-of type="video/quicktime" /-->
</mime-type>
<mime-type type="video/quicktime">
    <magic priority="50">
      <match value="moov" type="string" offset="4"/>
      <match value="mdat" type="string" offset="4"/>
      <!--remove for videos of screencast -->
      <!--match value="ftyp" type="string" offset="4"/-->
    </magic>
    <glob pattern="*.qt"/>
    <glob pattern="*.mov"/>
</mime-type>

java - 使用 Apache tika 获取 MimeType 子类型

4 回答 4

Related

Reference