scala - Scala 和 Spark 中最简单的文本词形还原方法

Question

我想在文本文件上使用词形还原：

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .
...

预期输出为：

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .
...

有谁能够帮我？谁知道在 Scala 和 Spark 中实现的最简单的词形还原方法？

score 7 · Accepted Answer

Spark 中的 Adavanced analitics 一书中有一个函数，关于 Lemmatization 的章节：

  val plainText =  sc.parallelize(List("Sentence to be precessed."))

  val stopWords = Set("stopWord")

  import edu.stanford.nlp.pipeline._
  import edu.stanford.nlp.ling.CoreAnnotations._
  import scala.collection.JavaConversions._

  def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
  lemmatized.foreach(println)

现在只需将它用于映射器中的每一行。

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

编辑：

我添加到代码行

import scala.collection.JavaConversions._

这是必需的，因为否则句子是 Java 而不是 Scala 列表。现在应该可以毫无问题地编译了。

我使用了 scala 2.10.4 和休闲 stanford.nlp 依赖项：

<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
</dependency>
<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
  <classifier>models</classifier>
</dependency>

您还可以查看 stanford.nlp 页面，其中有很多示例（Java 中）http://nlp.stanford.edu/software/corenlp.shtml。

编辑：

MapPartition 版本：

虽然我不知道它是否会显着加快工作速度。

  def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.mapPartitions(p => {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    p.map(q => plainTextToLemmas(q, stopWords, pipeline))
  })
  lemmatized.foreach(println)

score 2 · Accepted Answer

我认为@user52045 有正确的想法。我要做的唯一修改是使用 mapPartitions 而不是 map——这允许您只为每个分区创建一次潜在的昂贵管道。这对词形还原管道可能不是一个巨大的打击，但如果你想做一些需要模型的事情，比如管道的 NER 部分，这将是非常重要的。

def plainTextToLemmas(text: String, stopWords: Set[String], pipeline:StanfordCoreNLP): Seq[String] = {
  val doc = new Annotation(text)
  pipeline.annotate(doc)
  val lemmas = new ArrayBuffer[String]()
  val sentences = doc.get(classOf[SentencesAnnotation])
  for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
    val lemma = token.get(classOf[LemmaAnnotation])
    if (lemma.length > 2 && !stopWords.contains(lemma)) {
      lemmas += lemma.toLowerCase
    }
  }
  lemmas
}

val lemmatized = plainText.mapPartitions(strings => {
  val props = new Properties()
  props.put("annotators", "tokenize, ssplit, pos, lemma")
  val pipeline = new StanfordCoreNLP(props)
  strings.map(string => plainTextToLemmas(string, stopWords, pipeline))
})
lemmatized.foreach(println)

score 0 · Accepted Answer

我建议使用用于 Apache Spark 的 Stanford CoreNLP 包装器，因为它为基本核心 nlp 功能（例如词形还原、标记化等）提供了官方 API。

我在 spark 数据帧上使用了相同的词形还原。

使用链接：https ://github.com/databricks/spark-corenlp

scala - Scala 和 Spark 中最简单的文本词形还原方法

3 回答 3

Related

Reference