0

我有一些不同格式的文件(Html、PDF、doc、epub),使用 apache tika 和 java 我已经提取元数据并将其存储到 mongo db 中,现在我的目标是从文件内容中提取关键字或标签并将其添加到元数据字段之一,是否可以使用 Apache tika 或者如果没有,请建议我这样做的方法?

我的 mongodb 字段(示例)

{"Filename":"PHP Book.pdf","Author":"John" ,"Description":"This is my PHP Book"} 
{"Filename":"Java Book.html" ,"Author":"Paul" ,"Description":"This is my JAVA Book"}
{"Filename":".NET Book.doc" ,"Author":"James" ,"Description":"This is my .NET Book"}

现在我想添加另一个包含内容标签或关键字的字段,它应该如下所示(示例)

{"Filename":"PHP Book.pdf","Author":"John" ,"Description":"This is my PHP Book", "keywords":["PHP","PDF","BOOK"]} 
{"Filename":"Java Book.html" ,"Author":"Paul" ,"Description":"This is my JAVA Book","keywords":["JAVA","html","BOOK"]}
{"Filename":".NET Book.doc" ,"Author":"James" ,"Description":"This is my .NET Book",    "keywords":[".NET","doc"]}

谢谢

4

0 回答 0