java - Apache Tika - 不正确的 MIME（内容类型）检测

Question

我正在尝试将传递给 Web 服务的文件内容类型检测到 SOAP 信封中。该文件可以通过两种方式表示：

从它的网址，
来自它的包含（base64 压缩数据）。

此时，我可以将此文件转换为流缓冲区。但是，我所有尝试获取其内容类型的尝试都失败了。如果指示文件扩展名，则检测内容类型，否则内容总是被检测为“纯文本”。

贝娄是我的课程代码：

类元数据分析器 {

private InputStream _is;

private File _file;

private void initializeAttributes() {

    _is = null;
    _file= null;

}


private void createTemporaryFile(byte[] pData) {

    FileOutputStream fos = null;
    try {
        _file = File.createTempFile(
                UUID.randomUUID().toString().replace("-", ""),
                null,
                new File("C:\\Users\\Florent\\Documents\\NetBeansProjects\\ServiceEdition\\tmp"));
    } catch (IOException e) {
        e.printStackTrace();
    }
    try {
        fos = new FileOutputStream(_file);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    try {
        fos.write(pData);
    } catch (IOException e) {
        e.printStackTrace();
    }
    try {
        fos.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

    _file.deleteOnExit();

}

public MetadataAnalyser(byte[] pData) {

    initializeAttributes();
    _is = new ByteArrayInputStream(pData);
    createTemporaryFile(pData);

}

public MetadataAnalyser(InputStream pIs) {

    initializeAttributes();
    _is = pIs;
    _file = null;

}

public MetadataAnalyser(File pFile) {

    initializeAttributes();
    try {
        _file = pFile;
        _is = new FileInputStream(_file);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (Exception e) {
        e.printStackTrace();
    }

}

public MetadataAnalyser(String pFile) {

    initializeAttributes();
    try {
        _file = new File(pFile);
        if (_file.exists()) {
            _is = new FileInputStream(_file);
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (Exception e) {
        e.printStackTrace();
    }

}

public String getContentType() {

    AutoDetectParser parser = null;
    Metadata metadata = null;
    InputStream is = null;
    String mimeType = null;

    parser = new AutoDetectParser();
    parser.setParsers(new HashMap<MediaType, Parser>());
    metadata = new Metadata();
    if(_file != null) {
        metadata.add(TikaMetadataKeys.RESOURCE_NAME_KEY, _file.getName());
    }
    try {
        is = new FileInputStream(_file);
        parser.parse(is, new DefaultHandler(), metadata, new ParseContext());
        mimeType = metadata.get(HttpHeaders.CONTENT_TYPE);
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    } finally {
        return mimeType;
    }

}

}

那么，即使文件扩展名未知，如何检测 MIME 类型呢？

score 0 · Accepted Answer

我不认为你可以检测到没有扩展名的 mime 类型，你需要知道哪个系统正在写入文件以及预计会有什么样的文件，并且基于此你需要设置 MIME 类型（我猜你在您的回复中使用它）。

score 0 · Accepted Answer

您需要确保在将内容发送到 Tika 之前对其进行解码，不，绝对不需要扩展，检测是通过此处描述的一个很好理解的 mime 魔术过程进行的：https ://tika.apache.org/1.1/detection .html

java - Apache Tika - 不正确的 MIME（内容类型）检测

2 回答 2

Related

Reference