scala - 如何在scala中生成n-gram？

Question

我正在尝试在scala中编写基于n-gram的分离新闻算法。如何为大文件生成 n-gram：例如，对于包含“蜜蜂是蜜蜂中的蜜蜂”的文件。

首先，它必须选择一个随机的 n-gram。例如蜜蜂。
然后它必须寻找以 (n-1) 个单词开头的 n-gram。比如蜜蜂的。
它打印这个 n-gram 的最后一个单词。然后重复。

你能给我一些提示吗？带来不便敬请谅解。

score 13 · Accepted Answer

你的问题可能更具体一点，但这是我的尝试。

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))

score 5 · Accepted Answer

您可以尝试使用参数 n

val words = "the bee is the bee of the bees"
val w = words.split(" ")

val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println

List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

score 4 · Accepted Answer

这是一种基于流的方法。这在计算 n-gram 时不需要太多内存。

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

输出：

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

scala - 如何在scala中生成n-gram？

3 回答 3

Related

Reference