1

亲爱的 StackOverFlow 开发人员,我需要您的帮助。我被困在 Apache lucene 中,无法在 java swing 应用程序中使用。这个问题是如此复杂,以至于我什至困惑我应该如何问它。请尝试了解我的实际要求。情况很简单,我必须提供 html 文件,以便客户端可以在 swing 应用程序中访问它们,并且对于搜索工具,我决定使用 apache lucene 索引。这为我提供了搜索工具,但现在我想显示与搜索条件匹配的 html 文件数据。在 java API 中,我使用 swing,JEditorPane 是我必须在其中显示 html 文件内容的控件。请建议我如何索引 html 文件以及如何从 lucene 索引中获取 html 文件的内容。html 文件不仅有文本,而且还有链接,

在此先感谢希望您的帮助

4

1 回答 1

2

In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:

  • Stored the HTML document as is on disk (you can store in the DB as well).
  • Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
  • The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
  • Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
  • Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.

HTH.

于 2012-10-04T04:43:53.107 回答