1

我有一个应用程序需要我索引几 GB 的句子(大约 1600 万行)。

目前我的搜索按以下方式工作。

我的搜索词通常围绕一个短语。例如“在公园里跑步”。我希望能够搜索与此类似的句子或包含这些短语的一部分。我通过构建更小的短语来做到这一点:

“在公园里跑步”等。

他们每个人都有一个重量(更长的重量越大)

目前,我将每一行视为一个文档。典型的搜索大约需要几秒钟,我想知道是否有办法提高搜索速度。

最重要的是,我还需要得到匹配的东西。

例如:“I was jogging in the park this morning”匹配“in the park”,我想知道它是如何匹配的。我知道用于 lucene 搜索的解释器,但有没有更简单的方法,或者是否有资源可以让我学习如何从 Lucene 的解释器中提取我想要的信息。

我目前正在使用正则表达式来获取匹配项。它很快但不准确,因为 lucene 有时会忽略标点符号和其他东西,我无法处理所有特殊情况。

4

3 回答 3

3

荧光笔比解释器好,它更快。您可以在突出显示标签后提取标签之间的匹配短语。 Java正则表达式提取标签之间的文本

public class HighlightDemo {
Directory directory;
Analyzer analyzer;
String[] contents = {"running in the park",
        "I was jogging in the park this morning",
        "running on the road",
        "The famous New York Marathon has its final miles in Central park every year and it's easy to understand why: the park, with a variety of terrain and excellent scenery, is the ultimate runner's dream. With its many paths that range in level of difficulty, Central Park allows a runner to experience clarity and freedom in this picturesque urban oasis."};


@Before
public void setUp() throws IOException {


    directory = new RAMDirectory();
    analyzer = new WhitespaceAnalyzer();

    // indexed documents


    IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
    for (int i = 0; i < contents.length; i++) {
        Document doc = new Document();
        doc.add(new Field("content", contents[i], Field.Store.NO, Field.Index.ANALYZED)); // store & index
        doc.add(new NumericField("id", Field.Store.YES, true).setIntValue(i));      // store & index
        writer.addDocument(doc);
    }
    writer.close();
}

@Test
public void test() throws IOException, ParseException, InvalidTokenOffsetsException {
    IndexSearcher s = new IndexSearcher(directory);
    QueryParser parser = new QueryParser(Version.LUCENE_36, "content", analyzer);
    org.apache.lucene.search.Query query = parser.parse("park");

    TopDocs hits = s.search(query, 10);
    SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
    Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
    for (int i = 0; i < hits.scoreDocs.length; i++) {
        int id = hits.scoreDocs[i].doc;
        Document doc = s.doc(id);
        String text = contents[Integer.parseInt(s.doc(id).get("id"))];

        TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
        org.apache.lucene.search.highlight.TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                assertTrue(frag[j].toString().contains("<B>"));
                assertTrue(frag[j].toString().contains("</B>"));

                System.out.println(frag[j].toString());
            }
        }

    }

}
}
于 2013-01-08T08:49:23.600 回答
2

Lucene 的“contrib”模块Highlighter将让您提取 Lucene 匹配的内容。

于 2012-06-04T12:14:05.983 回答
0

SpanQueries 可能会帮助您找到查询在句子中匹配的位置: https ://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/spans/package-summary.html

使用它,您可以从查询中获得准确的位置: How to get the matching spans of a Span Term Query in Lucene 5?

于 2017-02-15T12:13:49.163 回答