lucene - 使用 RDD 进行单词归一化

Question

也许这个问题有点奇怪......但我会试着问它。

每个使用 Lucene API 编写应用程序的人都看到过这样的情况：

public static String removeStopWordsAndGetNorm(String text, String[] stopWords, Normalizer normalizer) throws IOException
{
    TokenStream tokenStream = new ClassicTokenizer(Version.LUCENE_44, new StringReader(text));
    tokenStream = new StopFilter(Version.LUCENE_44, tokenStream, StopFilter.makeStopSet(Version.LUCENE_44, stopWords, true));
    tokenStream = new LowerCaseFilter(Version.LUCENE_44, tokenStream);
    tokenStream = new StandardFilter(Version.LUCENE_44, tokenStream);
    tokenStream.reset();
    String result = "";
    while (tokenStream.incrementToken()) 
    {
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        try
        {
            //normalizer.getNormalForm(...) - stemmer or lemmatizer
            result += normalizer.getNormalForm(token.toString()) + " ";
        }
        catch(Exception e)
        {
            //if something went wrong
        }
    }
    return result;
}

是否可以使用 RDD 重写单词规范化？也许有人有这种转换的例子，或者可以指定关于它的网络资源？

谢谢你。

score 1 · Accepted Answer

我最近在一次演讲中使用了一个类似的例子。它显示了如何删除停用词。它没有标准化阶段，但如果它normalizer.getNormalForm来自可重用的库，它应该很容易集成。

此代码可能是一个起点：

// source text
val rdd = sc.textFile(...)  
// stop words src
val stopWordsRdd = sc.textFile(...) 
// bring stop words to the driver to broadcast => more efficient than rdd.subtract(stopWordsRdd)
val stopWords = stopWordsRdd.collect.toSet
val stopWordsBroadcast = sc.broadcast(stopWords)
val words = rdd.flatMap(line => line.split("\\W").map(_.toLowerCase))
val cleaned = words.mapPartitions{iterator => 
    val stopWordsSet = stopWordsBroadcast.value
    iterator.filter(elem => !stopWordsSet.contains(elem))
    }
// plug the normalizer function here
val normalized = cleaned.map(normalForm(_))

注意：这是从 Spark 作业的角度来看的。我对 Lucene 不熟悉。

lucene - 使用 RDD 进行单词归一化

1 回答 1

Related

Reference