r - 在语料库的每个文档中查找最频繁的术语

Question

我一直在使用 R 的tm包在分类问题上取得了很大的成功。我知道如何在整个语料库中找到最常用的术语（使用findFreqTerms()），但在文档中看不到任何可以找到最常用术语的内容（在我删除并删除停用词之后，但在我删除稀疏术语之前）在语料库中的每个单独文档中。我试过使用apply()andmax命令，但这给了我每个文档中术语出现的最大次数，而不是术语本身的名称。

library(tm)

data("crude")
corpus<-tm_map(crude, removePunctuation)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, tolower)
corpus<-tm_map(corpus, removeWords, stopwords("English"))
corpus<-tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
maxterms<-apply(dtm, 1, max)
maxterms
127 144 191 194 211 236 237 242 246 248 273 349 352 
 5  13   2   3   3  10   8   3   7   9   9   4   5 
353 368 489 502 543 704 708 
 4   4   4   5   5   9   4

想法？

score 4 · Accepted Answer

本的回答给出了你所要求的，但我不确定你所要求的是否明智。它不考虑关系。这是使用qdap 包的一种方法和第二种方法。他们将为您提供带有单词的列表（在 qdap 的情况下，是带有单词和频率的数据框列表。您可以使用unlist第一个选项和lapply、索引和unlistqdap 来帮助您完成剩下的工作。qdap 方法适用于生Corpus：

选项1：

apply(dtm, 1, function(x) unlist(dtm[["dimnames"]][2], 
    use.names = FALSE)[x == max(x)])

带有 qdap 的选项 #2：

library(qdap)
dat <- tm_corpus2df(crude)
tapply(stemmer(dat$text), dat$docs, freq_terms, top = 1, 
    stopwords = tm::stopwords("English"))

包裹tapplywithlapply(WRAP_HERE, "[", 1)使两个答案在内容和格式上几乎相同。

编辑：添加了一个更精简使用 qdap 的示例：

FUN <- function(x) freq_terms(x, top = 1, stopwords = stopwords("English"))[, 1]
lapply(stemmer(crude), FUN)

## [[1]]
## [1] "oil"   "price"
## 
## [[2]]
## [1] "opec"
## 
## [[3]]
## [1] "canada"   "canadian" "crude"    "oil"      "post"     "price"    "texaco"  
## 
## [[4]]
## [1] "crude"
## 
## [[5]]
## [1] "estim"  "reserv" "said"   "trust" 
## 
## [[6]]
## [1] "kuwait" "said"  
## 
## [[7]]
## [1] "report" "say"   
## 
## [[8]]
## [1] "yesterday"
## 
## [[9]]
## [1] "billion"
## 
## [[10]]
## [1] "market" "price" 
## 
## [[11]]
## [1] "mln"
## 
## [[12]]
## [1] "oil"
## 
## [[13]]
## [1] "oil"   "price"
## 
## [[14]]
## [1] "oil"  "opec"
## 
## [[15]]
## [1] "power"
## 
## [[16]]
## [1] "oil"
## 
## [[17]]
## [1] "oil"
## 
## [[18]]
## [1] "dlrs"
## 
## [[19]]
## [1] "futur"
## 
## [[20]]
## [1] "januari"

score 2 · Accepted Answer

您快到了，替换max为which.max以获取每个文档（即每行）频率最高的术语的列索引。然后使用该列索引向量对文档术语矩阵中的术语（或列名，种类）进行子集化。这将返回每个文档的实际术语，该文档具有该文档的最大频率（而不仅仅是频率值，就像您使用时所做的那样max）。所以，按照你的例子

maxterms<-apply(dtm, 1, which.max)
dtm$dimnames$Terms[maxterms]
[1] "oil"     "opec"    "canada"  "crude"   "said"    "said"    "report"  "oil"    
 [9] "billion" "oil"     "mln"     "oil"     "oil"     "oil"     "power"   "oil"    
[17] "oil"     "dlrs"    "futures" "january"

r - 在语料库的每个文档中查找最频繁的术语

2 回答 2

Related

Reference