r - 如何在 R 中执行词形还原？

Question

这个问题可能与R 或 python 中的 Lemmatizer (am, are, is -> be?)重复，但我再次添加它，因为前一个问题已关闭，说它太宽泛，唯一的答案不是高效（因为它为此访问了一个外部网站，这太慢了，因为我有非常大的语料库来查找引理）。所以这个问题的一部分将类似于上面提到的问题。

根据维基百科，词形还原定义为：

语言学中的词形还原（或词形还原）是将单词的不同变形形式组合在一起的过程，以便可以将它们作为单个项目进行分析。

一个简单的谷歌搜索 R 中的 lemmatization只会指向 R 的包wordnet。当我尝试这个包时，期望c("run", "ran", "running")输入到 lemmatization 函数的字符向量会导致c("run", "run", "run")，我看到这个包只提供类似于grepl通过各种过滤器的功能的功能名字和字典。

包中的示例代码wordnet，它最多提供 5 个以“car”开头的单词，因为过滤器名称说明了这一点：

filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)

以上不是我正在寻找的词形还原。我正在寻找的是，使用R我想找到单词的真正根源：（例如 from c("run", "ran", "running")to c("run", "run", "run")）。

score 31 · Accepted Answer

您好，您可以尝试koRpus允许使用Treetagger的软件包：

tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
                      TT.tknz=FALSE , lang="en",
                      TT.options=list(path="./TreeTagger", preset="en"))
tagged.results@TT.res

##     token tag lemma lttr wclass                               desc stop stem
## 1     run  NN   run    3   noun             Noun, singular or mass   NA   NA
## 2     ran VVD   run    3   verb                   Verb, past tense   NA   NA
## 3 running VVG   run    7   verb Verb, gerund or present participle   NA   NA

请参阅lemma列以获取您要求的结果。

score 19 · Accepted Answer

正如前一篇文章所提到的，R 包 textstem 中的函数 lemmatize_words() 可以执行此操作，并为您提供我所理解的您想要的结果：

library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)

## [1] "run" "run" "run"

score 9 · Accepted Answer

@Andy 和 @Arunkumar 说textstem库可用于执行词干提取和/或词形还原是正确的。但是，lemmatize_words() 仅适用于单词向量。但是在语料库中，我们没有词向量；我们有字符串，每个字符串都是文档的内容。因此，要对语料库执行词形还原，您可以使用函数 lemmatize_strings() 作为tm包的 tm_map() 的参数。

> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make 
terrific th-grade learning tool samuel beckett applied iranian voting process bard 
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make 
terrific th - grade learn tool samuel beckett apply iranian vote process bard black 
comedy willie love another trumpet blast may new mexican cinema - bornin"

完成词形还原后不要忘记运行以下代码行：

> corpus <- tm_map(corpus, PlainTextDocument)

这是因为为了创建一个文档术语矩阵，你需要有 'PlainTextDocument' 类型的对象，在你使用 lemmatize_strings() 后它会发生变化（更具体地说，语料库对象不包含内容和元数据不再是每个文档——它现在只是一个包含文档内容的结构；这不是 DocumentTermMatrix() 作为参数的对象类型）。

希望这可以帮助！

score 4 · Accepted Answer

也许词干对你来说就足够了吗？典型的自然语言处理任务使用词干文本。您可以从 NLP 的 CRAN 任务视图中找到几个包：http: //cran.r-project.org/web/views/NaturalLanguageProcessing.html

如果您确实需要更复杂的东西，那么可以使用基于将句子映射到神经网络的专门解决方案。据我所知，这些需要大量的训练数据。斯坦福 NLP 集团创建并提供了许多开放软件。

如果你真的想深入研究这个话题，那么你可以在同一个斯坦福 NLP 小组出版物部分链接的事件档案中进行挖掘。也有一些关于这个主题的书。

score 3 · Accepted Answer

我认为这里的答案有点过时了。您现在应该使用 R 包 udpipe - 可在https://CRAN.R-project.org/package=udpipe获得- 请参阅https://github.com/bnosac/udpipe或https://bnosac.github 上的文档。 io/udpipe/en

请注意以下示例中在进行词形还原和词干提取时单词meeting (NOUN) 和单词meet (VERB) 之间的区别，以及在进行词干提取时将单词“someone”变成“someon”的烦人。

library(udpipe)
x <- c(doc_a = "In our last meeting, someone said that we are meeting again tomorrow",
       doc_b = "It's better to be good at being the best")
anno <- udpipe(x, "english")
anno[, c("doc_id", "sentence_id", "token", "lemma", "upos")]
#>    doc_id sentence_id    token    lemma  upos
#> 1   doc_a           1       In       in   ADP
#> 2   doc_a           1      our       we  PRON
#> 3   doc_a           1     last     last   ADJ
#> 4   doc_a           1  meeting  meeting  NOUN
#> 5   doc_a           1        ,        , PUNCT
#> 6   doc_a           1  someone  someone  PRON
#> 7   doc_a           1     said      say  VERB
#> 8   doc_a           1     that     that SCONJ
#> 9   doc_a           1       we       we  PRON
#> 10  doc_a           1      are       be   AUX
#> 11  doc_a           1  meeting     meet  VERB
#> 12  doc_a           1    again    again   ADV
#> 13  doc_a           1 tomorrow tomorrow  NOUN
#> 14  doc_b           1       It       it  PRON
#> 15  doc_b           1       's       be   AUX
#> 16  doc_b           1   better   better   ADJ
#> 17  doc_b           1       to       to  PART
#> 18  doc_b           1       be       be   AUX
#> 19  doc_b           1     good     good   ADJ
#> 20  doc_b           1       at       at SCONJ
#> 21  doc_b           1    being       be   AUX
#> 22  doc_b           1      the      the   DET
#> 23  doc_b           1     best     best   ADJ
lemmatisation <- paste.data.frame(anno, term = "lemma", 
                                  group = c("doc_id", "sentence_id"))
lemmatisation
#>   doc_id sentence_id
#> 1  doc_a           1
#> 2  doc_b           1
#>                                                             lemma
#> 1 in we last meeting , someone say that we be meet again tomorrow
#> 2                          it be better to be good at be the best

library(SnowballC)
tokens   <- strsplit(x, split = "[[:space:][:punct:]]+")
stemming <- lapply(tokens, FUN = function(x) wordStem(x, language = "en"))
stemming
#> $doc_a
#>  [1] "In"       "our"      "last"     "meet"     "someon"   "said"    
#>  [7] "that"     "we"       "are"      "meet"     "again"    "tomorrow"
#> 
#> $doc_b
#>  [1] "It"     "s"      "better" "to"     "be"     "good"   "at"     "be"    
#>  [9] "the"    "best"

score 0 · Accepted Answer

可以使用 textStem 包在 R 中轻松完成词形还原。
步骤是：
1) 安装 textstem 2) 通过 3
) 加载包，其中 stem_word 是词形还原的结果，word 是输入词。 library(textstem)
stem_word=lemmatize_words(word, dictionary = lexicon::hash_lemmas)

r - 如何在 R 中执行词形还原？

6 回答 6

Related

Reference