1

I would like to program a Java web crawler that uses Apache Tika to download webpage textual content, but I'm a newbie to using Apache projects and I haven't found a definitive source that clarifies how to integrate Tika into programs, exactly. From what I've gathered from the Internet, I have built Tika with Maven in command line, but I'm not sure where to go from here to use Tika classes(?) like Parser, etc in my Java programs. I'm using Eclipse, if that makes a difference - I've also installed the Maven plugin for Eclipse but I'm not exactly sure what to do with it...Do I need to an "import..." line? Please excuse my "beginner" questions but a step-by-step guide to preparing Tika to be used would be appreciated.

4

1 回答 1

6

首先,您需要通读Apache Tika 入门指南,其中介绍了如何将 Tika 包含在您的项目中。(这假设您有一些将第三方 jar 包含到您自己的项目中的基本知识,如果没有,您需要阅读一些相关的教程)

在您的项目中开始使用 Tika 的最简单方法是通过 Tika Facade 类。这提供了一个可用于检测、解析为纯文本字符串以及通过阅读器解析为 xhtml 的类,所有这些都来自各种来源。所有的基础知识都在那里。

对于更高级的使用,您需要遵循Parser API 页面内容检测页面上提供的信息。您还可以按照Tika 示例使用 AutoDetectParser 进行解析,这应该可以完成您可能想要的操作,否则请浏览 Tika 示例的注释列表以及解释,以了解如何开始!

于 2013-07-24T08:35:27.160 回答