r - 将短语列表与文档语料库匹配并返回短语频率

Question

我有一个短语列表和一个文档语料库。语料库中有 100k+ 个短语和 60k+ 个文档。这些短语可能/可能不会出现在语料库中。我期待找到语料库中每个短语的词频。

一个示例数据集：

Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning")
Doc1 <- "If you're just starting with workout, begin slow."
Doc2 <- "Don't jump in brain initial and then try to operate several kilometers without the need of worked out well before."
Doc3 <- "It is possible to end up injuring on your own and carrying out more damage than good."
Doc4 <- "Instead start with a brief stroll and gradually boost the duration along with the speed."
Doc5 <- "Before you know it you'll be working 5 miles without any problems."

我是 R 中文本分析的新手，并且已经按照 Tyler Rinker 对此R 文本挖掘的解决方案解决了这个问题：计算特定单词在语料库中出现的次数？.

到目前为止，这是我的方法：

library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string=Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
        pos = 1, envir = as.environment(pos))

当我在 csv 中导出结果时，它只会告诉我短语 1 是否存在于任何文档中。

我期待如下输出（不包括不匹配的短语）：

Docs      Phrase1     Phrase2    Phrase3    Phrase4    Phrase5
1         0           1          2          0          0
2         1           0          0          1          0

score 0 · Accepted Answer

我尝试了您的方法，但无法复制：

使用：

library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string = Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
        pos = 1, envir = as.environment(pos))

我得到以下csv：

docs    word.count  term(just starting) term(several kilometers)    term(brief stroll)  term(gradually boost)   term(5 miles)   term(dark night)    term(cold morning)
1   7   1   0   0   0   0   0   0
2   12  0   1   0   0   0   0   0
3   7   0   0   0   0   0   0   0
4   9   0   0   1   1   0   0   0
5   7   0   0   0   0   0   0   0

r - 将短语列表与文档语料库匹配并返回短语频率

1 回答 1

Related

Reference