@Tyler Rinker 给出了答案,只需添加另一行removeWords()
,但这里有更多细节。
假设您的 excel 文件被调用nuts.xls
并且有一列这样的单词
stopwords
peanut
cashew
walnut
almond
macadamia
在R
你可能会这样进行
library(gdata) # package with xls import function
library(tm)
# now load the excel file with the custom stoplist, note a few of the arguments here
# to clean the data by removing spaces that excel seems to insert and prevent it from
# importing the characters as factors. You can use any args from read.table(), which is
# handy
nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)
# now make some words to build a corpus to test for a two-step stopword removal process...
words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")
words.all<-data.frame(rbind(words1,words2,words3))
words.corpus<-Corpus(DataframeSource((words.all)))
# now remove the standard list of stopwords, like you've already worked out
words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
# now remove the second set of stopwords, this time your custom set from the excel file,
# note that it has to be a reference to a character vector containing the custom stopwords
words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)
# have a look to see if it worked
inspect(words.corpus.nostopwords)
A corpus with 3 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$words1
, , , , apple, pear, orange, lime, mandarin, , ,
$words2
, , , , apple, pear, orange, lime, mandarin, , ,
$words3
, , , , apple, pear, orange, lime, mandarin, , ,
成功!标准停用词消失了,excel 文件中自定义列表中的单词也消失了。毫无疑问,还有其他方法可以做到这一点。