为了在处理非常大的语料库样本时节省内存空间,我希望只取前 10 个 1 克并将它们与所有 2 到 5 克结合起来形成我的单个 quanteda::dfmSparse 对象,该对象将用于自然语言处理[nlp] 预测。携带所有 1 克将毫无意义,因为只有前十 [或二十] 将与我正在使用的简单后退模型一起使用。
我找不到一个 quanteda::dfm(corpusText, . . .) 参数来指示它只返回顶部 ## 功能。因此,根据包作者@KenB 在其他线程中的评论,我正在使用 dfm_select/remove 函数来提取前十个 1 克,并基于“quanteda dfm join”搜索结果命中“在 'quanteda' 包中连接 dfm 矩阵”我'正在使用 rbind.dfmSparse??? 函数加入这些结果。
到目前为止,据我所知,一切看起来都是正确的。以为我会从 SO 社区中退出这个游戏计划,看看我是否忽略了一条更有效的途径来达到这个结果,或者我迄今为止所达到的解决方案存在一些缺陷。
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/eoq