r - 使用 tm 和 RWeka 创建 N-Grams - 适用于 VCorpus 但不适用于 Corpus

Question

遵循使用 'tm' 和 'RWeka' 包创建biGrams的许多指南后，我对tdm中只返回1-Grams感到沮丧。经过多次反复试验，我发现使用“ VCorpus ”实现了正确的功能，但没有使用“ Corpus ”。顺便说一句，我很确定这在大约 1 个月前与“语料库”合作，但现在不是。

R (3.3.3)、RTools (3.4)、RStudio (1.0.136) 和所有软件包（tm 0.7-1、RWeka 0.4-31）已更新到最新版本。

如果其他人有同样的问题，我将不胜感激。

#A Reproducible example
#
#Weka bi-gram test
#

library(tm)
library(RWeka)

someCleanText <- c("Congress shall make no law respecting an establishment of",
                    "religion, or prohibiting the free exercise thereof or",
                    "abridging the freedom of speech or of the press or the",
                    "right of the people peaceably to assemble and to petition",
                    "the Government for a redress of grievances")

aCorpus <- Corpus(VectorSource(someCleanText))   #With this, only 1-Grams are created
#aCorpus <- VCorpus(VectorSource(someCleanText)) #With this, biGrams are created as desired

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

aTDM <- TermDocumentMatrix(aCorpus, control=list(tokenize=BigramTokenizer))

print(aTDM$dimnames$Terms)

结果与“语料库”

 [1] "congress"      "establishment" "law"           "make"         
 [5] "respecting"    "shall"         "exercise"      "free"         
 [9] "prohibiting"   "religion"      "the"           "thereof"      
[13] "abridging"     "freedom"       "press"         "speech"       
[17] "and"           "assemble"      "peaceably"     "people"       
[21] "petition"      "right"         "for"           "government"   
[25] "grievances"    "redress"

'VCorpus' 的结果

 [1] "a redress"        "abridging the"    "an establishment" "and to"          
 [5] "assemble and"     "congress shall"   "establishment of" "exercise thereof"
 [9] "for a"            "free exercise"    "freedom of"       "government for"  
[13] "law respecting"   "make no"          "no law"           "of grievances"   
[17] "of speech"        "of the"           "or of"            "or prohibiting"  
[21] "or the"           "peaceably to"     "people peaceably" "press or"        
[25] "prohibiting the"  "redress of"       "religion or"      "respecting an"   
[29] "right of"         "shall make"       "speech or"        "the free"        
[33] "the freedom"      "the government"   "the people"       "the press"       
[37] "thereof or"       "to assemble"      "to petition"

score 0 · Accepted Answer

我正在使用 R.3.4.1 并更改为 R3.3.3，现在 VCorpus 解决方案对我有用。TM 和 RWeka 都正确地创建了二元组。

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

score 0 · Accepted Answer

我能够重现您得到的完全相同的结果。

当我开始阅读有关Corpus和VCorpus的内容时，大多数参考资料都指出，区别基本上在于 VCorpus 是一种保留在内存中的易失性 Corpus，但这并不是唯一的区别。Corpus 默认使用 SimpleCorpus，它不具有 VCorpus 所具有的所有属性，这就是为什么您能够使用 VCorpus 而不是使用常规 Corpus 获得 2-grams。有关此内容的更多信息，请参阅 stackexchange 中的此帖子： https ://stats.stackexchange.com/questions/164372/what-is-vectorsource-and-vcorpus-in-tm-text-mining-package-in-r

r - 使用 tm 和 RWeka 创建 N-Grams - 适用于 VCorpus 但不适用于 Corpus

2 回答 2

Related

Reference