这是计算特征卡方值的通用方法。它要求您有一些变量来形成关联,这可能是您用于训练分类器的一些分类变量。
请注意,我在quanteda包中展示了如何执行此操作,但结果应该足够通用,可以适用于其他文本包矩阵对象。在这里,我使用来自辅助quantedaData包的数据,该包包含美国总统的所有国情咨文地址。
data(data_corpus_sotu, package = "quanteda.corpora")
table(docvars(data_corpus_sotu, "party"))
## Democratic Democratic-Republican Federalist Independent
## 90 28 4 8
## Republican Whig
## 9 8
sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))
# compute chi-squared values for each feature
chi2vals <- apply(sotuDfm, 2, function(x) {
chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
})
head(sort(chi2vals, decreasing = TRUE), 10)
## government will united states year public congress upon
## 85.19783 74.55845 68.62642 66.57434 64.30859 63.19322 59.49949 57.83603
## war people
## 57.43142 57.38697
现在可以使用dfm_select()
命令选择这些。(请注意,按名称进行列索引也可以。)
# select just 100 top Chi^2 vals from dfm
dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs)
## Document-feature matrix of: 182 documents, 100 features.
## (showing first 6 documents and first 6 features)
## features
## docs citizens government upon duties constitution present
## Jackson-1830 14 68 67 12 17 23
## Jackson-1831 21 26 13 7 5 22
## Jackson-1832 17 36 23 11 11 18
## Jackson-1829 17 58 37 16 7 17
## Jackson-1833 14 43 27 18 1 17
## Jackson-1834 24 74 67 11 11 29
补充:>= v0.9.9 可以使用该textstat_keyness()
函数来完成。
# to avoid empty factors
docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))
# make the document-feature matrix for just Reps and Dems
sotuDfm <- data_corpus_sotu %>%
corpus_subset(party %in% c("Democratic", "Republican")) %>%
dfm(remove = stopwords("english"))
chi2vals <- dfm_group(sotuDfm, "party") %>%
textstat_keyness(measure = "chi2")
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 5 million 132.3267 0 366 131
# 6 texas 101.1991 0 174 37
在移除 chi^2 分数的符号后,此信息可用于选择最具辨别力的特征。
# remove sign
chi2vals$chi2 <- abs(chi2vals$chi2)
# sort
chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
head(chi2vals)
# feature chi2 p n_target n_reference
# 1 - 221.6249 0 2418 1645
# 29044 commission 190.3010 0 175 588
# 2 mexico 181.0586 0 505 182
# 3 bank 164.9412 0 283 60
# 4 " 148.6333 0 1265 800
# 29043 law 137.8330 0 607 1178
dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
## kept 100 features, from 100 supplied (glob) feature types
head(dfmTop100cs, nf = 6)
Document-feature matrix of: 6 documents, 6 features (0% sparse).
6 x 6 sparse Matrix of class "dfm"
features
docs fellow citizens senate house representatives :
Jackson-1829 5 17 2 3 5 1
Jackson-1830 6 14 4 6 9 3
Jackson-1831 9 21 3 1 4 1
Jackson-1832 6 17 4 1 2 1
Jackson-1833 2 14 7 4 6 1
Jackson-1834 3 24 5 1 3 5