我正在研究Java中的Lucene算法。我们在MySQL数据库中有 100K 停止名称。停止名称就像
NEW YORK PENN STATION,
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc
当用户提供像NEW YORK这样的搜索输入时,我们会在结果中得到NEW YORK PENN STATION stop,但是当用户在搜索输入中提供确切的NEW YORK PENN STATION时,它会返回零个结果。
我的代码是 -
public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
{
ArrayList<String> arResult = new ArrayList<String>();
try
{
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
for(int i = 0; i < source.size(); i++)
{
addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
}
w.close();
// 2. query
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
// 3. search
int hitsPerPage = 20;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for(int i = 0; i < hits.length; ++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
arResult.add(d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
catch(Exception e)
{
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
return arResult;
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException
{
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
在此代码源中是停止名称列表,查询是用户给定的搜索输入。
Lucene 算法是否适用于大字符串?
为什么 Lucene 算法不适用于精确字符串?