r - R：使用 Quanteda 包删除CommonTerms？

Question

TM 包的 removeCommonTerms 函数位于此处，这样

removeCommonTerms <- function (x, pct) 
{
    stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
        is.numeric(pct), pct > 0, pct < 1)
    m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
    else x
    t <- table(m$i) < m$ncol * (pct)
    termIndex <- as.numeric(names(t[t]))
    if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
    else x[termIndex, ]
}

现在我想删除 Quanteda 包中过于常见的术语。我可以在创建文档特征矩阵或使用文档特征矩阵之前进行此删除。

如何使用 R 中的 Quanteda 包删除过于常见的术语？

score 2 · Accepted Answer

你想要这个dfm_trim功能。从?dfm_trim

max_docfreq出现特征的最大文档数或分数，超过该特征将被删除。（默认为无上限。）

这需要最新版本的quanteda（CRAN 上的新版本）。

packageVersion("quanteda")
## [1] ‘0.9.9.3’

inaugdfm <- dfm(data_corpus_inaugural)

dfm_trim(inaugdfm, max_docfreq = .8)
## Removing features occurring: 
##   - in more than 0.8 * 57 = 45.6 documents: 93
##   Total features removed: 93 (1.01%).
## Document-feature matrix of: 57 documents, 9,081 features (92.4% sparse).

r - R：使用 Quanteda 包删除CommonTerms？

1 回答 1

Related

Reference