0

我正在从事一个文本挖掘项目,该项目计划稍后集成 Lucene。我当前的实现将 openNLP 用于常见的 NLP 任务,例如标记化、构建 n-gram 特征。我很想知道Lucene是否可以支持这些功能?与 openNLP 相比,Lucene 是否可以实现对大规模文档集合的高效率?

4

1 回答 1

1
  1. Lucene provides tokenization and n-gram analysis.
  2. If your Lucene documents have one or more categories, then you can implement a Hyperpipes classifier by counting the number of each category your hits fall into, then awarding the category with the most hits as the category of your query. (I'm sure there are other classifiers you could implement -- Hyperpipes just happened to come to mind as it kind of falls out of the wash from using a search engine as the backend.)
  3. Since Lucene is a library, you can use it from a GUI, a command-line program, or a service (daemon).
于 2012-12-17T22:20:31.007 回答