这是一个更好的方法的建议,改用quanteda包。该方法:
- 创建一个命名的权重向量,对应于您的“字典”。
- 创建一个文档特征矩阵,只选择字典中的术语。
- 加权观察到的计数。
# set modal values as a named numeric vector
modals <- c(5, 4, 4, 3, 2, 1)
names(modals) <- c("must", "will", "shall", "should", "may", "can")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
我将在这里使用最近的就职演讲作为可重复的示例。
dfmat <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
dfm() %>%
dfm_select(pattern = names(modals))
这会产生原始计数。
dfmat
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
## features
## docs will must can should may shall
## 2001-Bush 23 6 6 1 0 0
## 2005-Bush 22 6 7 1 3 0
## 2009-Obama 19 8 13 0 3 3
## 2013-Obama 20 17 7 0 4 0
## 2017-Trump 40 3 1 1 0 0
现在对此进行加权就像调用dfm_weight()
通过权重向量的值重新加权计数一样简单。该函数将使用向量元素名称的固定匹配自动将权重应用于 dfm 特征。
dfm_weight(dfmat, weight = modals)
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
## features
## docs will must can should may shall
## 2001-Bush 92 30 6 3 0 0
## 2005-Bush 88 30 7 3 6 0
## 2009-Obama 76 40 13 0 6 12
## 2013-Obama 80 85 7 0 8 0
## 2017-Trump 160 15 1 3 0 0