quanteda - Quanteda：给定 n-1 个前置词/类型，如何获取 ngram 及其频率

Question

对于使用 ngram 的下一个单词预测，我需要在给定 n-1 个前置词的情况下找到所有 ngram（及其频率）。
在dfm 中我看不到任何方法可以做到这一点，所以开始在 texstat_frequency (data.frame) 上手动实现它。在遇到一些方法后，我在这个页面上
的文档不清楚) 因此这个问题。（隐含地可能错误地排除使用我通常喜欢的正则表达式，因为偏见认为在数十万个字符串上运行它们可能太慢/太重）

按照评论中的建议查看 fcm() ，但我只能获得遵循 ngram 的 ngram，如下面的代码所示，这不是我所要求的，因为它仅适用于 n = 2 （并且需要将结果矩阵子集到给定（n-1）克）。

txt <- c("a b 1 2 3 a b 2 3 4 a b 3 4 5")
fcm(tokens(txt, ngram = 2), "window", window = 1, ordered = T)
Feature co-occurrence matrix of: 10 by 10 features.
10 x 10 sparse Matrix of class "fcm"
        features
features a_b b_1 1_2 2_3 3_a b_2 3_4 4_a b_3 4_5
     a_b   0   1   0   0   0   1   0   0   1   0
     b_1   0   0   1   0   0   0   0   0   0   0
     1_2   0   0   0   1   0   0   0   0   0   0
     2_3   0   0   0   0   1   0   1   0   0   0
     3_a   1   0   0   0   0   0   0   0   0   0
     b_2   0   0   0   1   0   0   0   0   0   0
     3_4   0   0   0   0   0   0   0   1   0   1
     4_a   1   0   0   0   0   0   0   0   0   0
     b_3   0   0   0   0   0   0   1   0   0   0
     4_5   0   0   0   0   0   0   0   0   0   0

上面的代码使用了从 github 2018 年 8 月 20 日安装的 quanteda，它应该包含这个问题生成的修复

packageVersion("quanteda")
[1] ‘1.3.5’

score 0 · Accepted Answer

包贡献者友好地提供了示例代码（此处），该代码显示了如何实现我所要求的，文本不太大。我在这里复制了该代码，并进行了一些简化和注释，以使其尽可能易于理解

sample_code <- function() {

  require(quanteda)

  print(paste("based on","https://github.com/quanteda/quanteda/issues/1413#issuecomment-414795832"))
  print("great package great support, thanks")

  ngms <- tokens("a b 1 2 3 a b 2 3 4 a b 3 4 5", n = 2:5)

  # get rid of tokens metadata not necessary for our UC
  ngms_lst <-  as.list(ngms)
  ngms_unlst  <- unlist(ngms_lst) # (named) character with _ sep. ngrams

  # split in " "-separated pairs:  "n-1 tokens", "nth token"
  ngms_blank_sep <- stringi::stri_replace_last_fixed(ngms_unlst,"_", " ")

  # list of character(2)  ( (n-1)gram ,nth token )
  tk2_lst <- tokens(ngms_blank_sep)

  # --- end of tokens/ngrams pre-processing

  # ordinary fcm
  fcm_ord <- fcm(tk2_lst , ordered = TRUE)

  fcm_ord[33:39, 1:6]
}


sample_code()
[1] "based on https://github.com/quanteda/quanteda/issues/1413#issuecomment-414795832"
[1] "great package great support, thanks"
Feature co-occurrence matrix of: 7 by 6 features.
7 x 6 sparse Matrix of class "fcm"
         features
features  a b 1 2 3 4
  3_a_b_2 0 0 0 0 1 0
  a_b_2_3 0 0 0 0 0 1
  b_2_3_4 1 0 0 0 0 0
  2_3_4_a 0 1 0 0 0 0
  3_4_a_b 0 0 0 0 1 0
  4_a_b_3 0 0 0 0 0 1
  a_b_3_4 0 0 0 0 0 0

quanteda - Quanteda：给定 n-1 个前置词/类型，如何获取 ngram 及其频率

1 回答 1

Related

Reference