r - 是否可以为 RTextTools 包提供自定义停用词列表？

Question

使用 tm 包，我可以这样做：

c0 <- Corpus(VectorSource(text))
c0 <- tm_map(c0, removeWords, c(stopwords("english"),mystopwords))

mystopwords作为我要删除的其他停用词的向量。

但是我找不到使用 RTextTools 包的等效方法。例如：

dtm <- create_matrix(text,language="english",
             removePunctuation=T,
             stripWhitespace=T,
             toLower=T,
             removeStopwords=T, #no clear way to specify a custom list here!
             stemWords=T)

是否有可能做到这一点？我真的很喜欢这个RTextTools界面，很遗憾不得不回到tm.

score 3 · Accepted Answer

您的问题有三种（或更多）解决方案：

首先，tm仅将包用于删除单词。两个包都处理相同的对象，因此您可以tm仅用于删除单词而不是RTextTools包。即使您查看函数内部，create_matrix它也使用tm函数。

二是修改create_matrix功能。例如添加一个输入参数，own_stopwords=NULL并添加以下行：

# existing line
corpus <- Corpus(VectorSource(trainingColumn), 
                     readerControl = list(language = language))
# after that add this new line
if(!is.null(own_stopwords)) corpus <- tm_map(corpus, removeWords, 
                                          words=as.character(own_stopwords))

第三，编写自己的函数，如下所示：

# excluder function
remove_my_stopwords<-function(own_stw, dtm){
  ind<-sapply(own_stw, function(x, words){
    if(any(x==words)) return(which(x==words)) else return(NA)
  }, words=colnames(dtm))
  return(dtm[ ,-c(na.omit(ind))])  
}

让我们看看它是否有效：

# let´s test it
data(NYTimes)
data <- NYTimes[sample(1:3100, size=10,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"], data["Subject"]))

head(colnames(matrix), 5)
# [1] "109"         "200th"       "abc"         "amid"        "anniversary"


# let´s consider some "own" stopwords as words above
ostw <- head(colnames(matrix), 5)

matrix2<-remove_my_stopwords(own_stw=ostw, dtm=matrix)

# check if they are still there
sapply(ostw, function(x, words) any(x==words), words=colnames(matrix2))
#109       200th         abc        amid anniversary 
#FALSE       FALSE       FALSE       FALSE       FALSE

高温高压

score 0 · Accepted Answer

您可以在同一个列表中添加停用词。例如：

c0 <- tm_map(c0, removeWords, c(stopwords("english"),"mystopwords"))

r - 是否可以为 RTextTools 包提供自定义停用词列表？

2 回答 2

Related

Reference