r - Quanteda 短语令牌不起作用

Question

情况一

在 Quanteda 包中应用短语totoken 函数时，我得到了奇怪的结果：

dict    <- dictionary(list(words = ......*lokale energie productie*......)) 
txt     <- c("I like lokale energie producties) 
phrasetotoken(txt, dict)

问题：有时我会lokale_energie_producties返回，有时会错误地返回原始lokale energie producties.

这个问题似乎与字典中的点有关。这些点是（？）处理开头和结尾字符（例如，“1lokale energie productieniveau”）所必需的。

情况2

在 txt 文件中加载时，prasetotoken 函数根本不起作用。

txt <- paste(readLines("foo.txt", collapse=" ")
txt <- phrasetotoken(txt, dict)

注意。使用该函数readtext而不是readLines引发以下错误

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘phrasetotoken’ for signature ‘"readtext", "dictionary"’

任何帮助表示赞赏。

score 0 · Accepted Answer

情况一

我们换成phrasetotoken()了更强大、更灵活的功能tokens_compound()。它是这样工作的（在对代码进行一些修改以使其在语法上正确之后）：

txt <- c("I like lokale energie producties") 
toks <- tokens(txt)

tokens_compound(toks, list(words = c("*lokale", "energie",  "productie*")))
## tokens from 1 document.
## Component 1 :
## [1] "I"                         "like"                      "lokale_energie_producties"

情况2

请尝试以下工作流程：

require(magrittr)  # for the pipes
readtext("foo.txt") %>%
    corpus() %>%
    tokens() %>%
    tokens_compound(sequences = dict)

r - Quanteda 短语令牌不起作用

1 回答 1

情况一

情况2

Related

Reference