java - 为什么 Lucene 算法不适用于 Java 中的精确字符串？

Question

我正在研究Java中的Lucene算法。我们在MySQL数据库中有 100K 停止名称。停止名称就像

NEW YORK PENN STATION, 
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc

当用户提供像NEW YORK这样的搜索输入时，我们会在结果中得到NEW YORK PENN STATION stop，但是当用户在搜索输入中提供确切的NEW YORK PENN STATION时，它会返回零个结果。

我的代码是 -

public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
  {
      ArrayList<String> arResult = new ArrayList<String>();

        try
        {
            // 0. Specify the analyzer for tokenizing text.
            //    The same analyzer should be used for indexing and searching
            StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

            // 1. create the index
            Directory index = new RAMDirectory();

            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);

            IndexWriter w = new IndexWriter(index, config);

            for(int i = 0; i < source.size(); i++)
            {
                addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
            }

            w.close();

            // 2. query
            // the "title" arg specifies the default field to use
            // when no field is explicitly specified in the query.
            Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

            // 3. search
            int hitsPerPage = 20;
            IndexReader reader = DirectoryReader.open(index);
            IndexSearcher searcher = new IndexSearcher(reader);
            TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
            searcher.search(q, collector);
            ScoreDoc[] hits = collector.topDocs().scoreDocs;

            // 4. Get results
            for(int i = 0; i < hits.length; ++i) 
            {
                  int docId = hits[i].doc;
                  Document d = searcher.doc(docId);
                  arResult.add(d.get("title"));
            }

            // reader can only be closed when there
            // is no need to access the documents any more.
            reader.close();

        }
        catch(Exception e)
        {
            System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
        }

        return arResult;

  }

  private static void addDoc(IndexWriter w, String title, String isbn) throws IOException 
  {
        Document doc = new Document();
        doc.add(new TextField("title", title, Field.Store.YES));

        // use a string field for isbn because we don't want it tokenized
        doc.add(new StringField("isbn", isbn, Field.Store.YES));
        w.addDocument(doc);
  }

在此代码源中是停止名称列表，查询是用户给定的搜索输入。

Lucene 算法是否适用于大字符串？

为什么 Lucene 算法不适用于精确字符串？

score 2 · Accepted Answer

代替

1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");

例如：“new york station”将被解析为“title:new title:york title:station”。此查询将返回包含上述任何术语的所有文档。

尝试这个..

2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");

Ex1 : "new york" 将被解析为 "+(title:new title:york)"

上面的“+”表示结果文档中出现“ must ”一词。它将匹配包含“new york”和“new york station”的文档

Ex2 : "new york station" 将被解析为 +(title:new title:york title:station)。该查询将仅匹配“new york station”，而不仅仅是“new york”，因为 station 不存在。

请确保字段名称“标题”是您要查找的内容。

你的问题。

Lucene 算法是否适用于大字符串？

您必须定义什么是大字符串。您是否真的在寻找Phrase Search。一般来说，是的，Lucene 适用于大字符串。

为什么 Lucene 算法不适用于精确字符串？

因为解析 ("querystr" + "* ") 将生成单独的术语查询，并使用 OR 运算符连接它们。例如：'new york*' 将被解析为："title:new OR title:york*

如果您期待找到“纽约站”，上面的通配符查询不是您应该寻找的。这是因为您传入的 StandardAnalyser 在编制索引时会将纽约站标记（分解术语）为 3 个术语。

因此，查询“york*”将找到“york station”只是因为它有“york”，而不是因为通配符，因为“york”不知道“station”，因为它们是不同的术语，即不同的条目索引。

您真正需要的是一个用于查找确切字符串的PhraseQuery，其查询字符串应该是带有引号的“new york”

java - 为什么 Lucene 算法不适用于 Java 中的精确字符串？

1 回答 1

Related

Reference