2

我有两个语料库(我变成 DocumentTermMatrices、数据框,然后是 wordclouds),其中一个是另一个的子集。准确地说,一个是关于一所大学的文本语料库,另一个是关于该会议中所有大学的文本语料库。

R中有没有办法只提取较小的单词集独有的单词?这是我迄今为止为每个语料库运行的一种(这是用于“会议”语料库)

> SECDraft = read.csv("SECDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))
> wordcloud(colnames(SECallReports), colSums(SECallReports), random.order = FALSE, max.words = 200, scale=c(2, 0.25))

多谢你们!

4

1 回答 1

1

正如我对您的另一篇文章的回复一样,我会在quantedapackage中执行此操作。我无法对此进行测试,因为我没有您的 .csv 文件,但这应该可以:

# install.packages(quanteda)
require(quanteda)

# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "report"))
# assign docset variables to each corpus as appropriate 
docvars(mycorpus1, "docset") <- 1 
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2

myDfm <- dfm(myCombinedCorpus, 
             groups = "docset", # by docset instead of document
             ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
             matrixType = "dense")

# create a logical vector indexing the features unique to corpus 1
uniqueToCorpus1 <- (myDfm[1, ] & !myDfm[2, ])
# this is the dfm with features unique to dfm1
myDfm[1, uniqueToCorpus1]
# list the word features as a character vector
features(myDfm[1, uniqueToCorpus1])
于 2015-06-01T15:52:50.897 回答