inner-join - 基于情感分析将自定义（加权）字典应用于文本

Question

我正在寻找调整此代码，以便我可以为这些情态动词中的每一个分配不同的权重。这个想法是使用类似于 NRC 库的东西，其中我们有 1-5 的“数字”代表类别，而不是数字。

modals<-data_frame(word=c("must", "will", "shall", "should", "may", "can"), 
modal=c("5", "4", "4", "3", "2", "1"))

我的问题是，当我运行以下代码时，我有 5 个“可能”与一个“必须”一样。我想要的是每个单词都有不同的权重，这样当我运行这个分析时，我可以看到更强的“必须”与说更弱的“可以”的使用集中度。*“tidy.DF”是我的语料库，“school”和“target”是列名。

MODAL<-tidy.DF %>%
  inner_join(modals) %>%
  count(School, Target, modal, index=wordnumber %/% 50, modal) %>%
  spread(modal, n, fill=0)

ggplot(MODAL, aes(index, 5, fill=Target)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Target, ncol=2, scales="free_x")

score 0 · Accepted Answer

这是一个更好的方法的建议，改用quanteda包。该方法：

创建一个命名的权重向量，对应于您的“字典”。
创建一个文档特征矩阵，只选择字典中的术语。
加权观察到的计数。

# set modal values as a named numeric vector
modals <- c(5, 4, 4, 3, 2, 1)
names(modals) <- c("must", "will", "shall", "should", "may", "can")

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

我将在这里使用最近的就职演讲作为可重复的示例。

dfmat <- data_corpus_inaugural %>%
  corpus_subset(Year > 2000) %>%
  dfm() %>%
  dfm_select(pattern = names(modals))

这会产生原始计数。

dfmat
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
##             features
## docs         will must can should may shall
##   2001-Bush    23    6   6      1   0     0
##   2005-Bush    22    6   7      1   3     0
##   2009-Obama   19    8  13      0   3     3
##   2013-Obama   20   17   7      0   4     0
##   2017-Trump   40    3   1      1   0     0

现在对此进行加权就像调用dfm_weight()通过权重向量的值重新加权计数一样简单。该函数将使用向量元素名称的固定匹配自动将权重应用于 dfm 特征。

dfm_weight(dfmat, weight = modals)
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
##             features
## docs         will must can should may shall
##   2001-Bush    92   30   6      3   0     0
##   2005-Bush    88   30   7      3   6     0
##   2009-Obama   76   40  13      0   6    12
##   2013-Obama   80   85   7      0   8     0
##   2017-Trump  160   15   1      3   0     0

inner-join - 基于情感分析将自定义（加权）字典应用于文本

1 回答 1

Related

Reference