如果您想使用多种文件格式,Apache Tika是您的最佳选择。Tika 提供了一个通用接口,用于从大量格式中提取文本和元数据,并向您隐藏不同类型和格式的复杂性。
在命令行上,要从这个示例文件中提取元数据,您需要这样做
java -jar tika-app-1.4.jar --metadata quick.odt
你会得到大量的元数据:
Author: Jesper Steen Møller
Character Count: 43
Content-Length: 7042
Content-Type: application/vnd.oasis.opendocument.text
Creation-Date: 2005-09-06T23:34:00
Edit-Time: PT2M0S
Image-Count: 0
Keywords: Pangram, fox, dog
Last-Modified: 2005-09-06T23:49:00
Last-Save-Date: 2005-09-06T23:49:00
Object-Count: 0
Page-Count: 1
Paragraph-Count: 1
Table-Count: 0
Word-Count: 9
cp:subject: Gym class featuring a brown fox and lazy dog
creator: Jesper Steen Møller
date: 2005-09-06T23:49:00
dc:creator: Jesper Steen Møller
dc:description: Gym class featuring a brown fox and lazy dog
dc:language: en-US
dc:subject: Pangram, fox, dog
dc:title: The quick brown fox jumps over the lazy dog
dcterms:created: 2005-09-06T23:34:00
dcterms:modified: 2005-09-06T23:49:00
description: Gym class featuring a brown fox and lazy dog
editing-cycles: 5
generator: OpenOffice.org/1.9.125$Win32 OpenOffice.org_project/680m125$Build-8947
initial-creator: Nevin Nollop
language: en-US
meta:author: Jesper Steen Møller
meta:character-count: 43
meta:creation-date: 2005-09-06T23:34:00
meta:image-count: 0
meta:initial-author: Nevin Nollop
meta:object-count: 0
meta:page-count: 1
meta:paragraph-count: 1
meta:save-date: 2005-09-06T23:49:00
meta:table-count: 0
meta:word-count: 9
modified: 2005-09-06T23:49:00
nbCharacter: 43
nbImg: 0
nbObject: 0
nbPage: 1
nbPara: 1
nbTab: 0
nbWord: 9
resourceName: quick.odt
subject: Gym class featuring a brown fox and lazy dog
title: The quick brown fox jumps over the lazy dog
xmpTPg:NPages: 1
从 Java 中,您可以通过以下简单的方式获得相同的结果
TikaConfig tika = TikaConfig.getDefaultConfig();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
InputStream input = TikaInputStream.get(new File("test.ods"));
tika.getParser().parse(input, null, metadata, context);
你会得到Metadata 对象的元数据