我不是 100% 你所追求的,也不完全了解它是如何tm_map
工作的。如果我理解,那么以下工作。据我了解,您想提供一个不应被阻止的单词列表。我使用 qdap 包主要是因为我很懒,而且它有mgsub
我喜欢的功能。
请注意,我对使用感到沮丧,mgsub
因为tm_map
它不断抛出错误,所以我只使用lapply
了。
texts <- c("i am member of the XYZ association",
"apply for our open associate position",
"xyz memorial lecture takes place on wednesday",
"vote for the most popular lecturer")
library(tm)
# Step 1: Create corpus
corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
library(qdap)
# Step 2: list to retain and indentifier keys
retain <- c("lecturer", "lecture")
replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
# Step 3: sub the words you want to retain with identifier keys
corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
# Step 4: Stem it
corpus.temp <- tm_map(corpus, stemDocument, language = "english")
# Step 5: reverse -> sub the identifier keys with the words you want to retain
corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
inspect(corpus) #inspect the pieces for the folks playing along at home
inspect(corpus.copy)
inspect(corpus.temp)
# Step 6: complete the stem
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)
inspect(corpus.final)
基本上它的工作原理是:
- 为提供的“NO STEM”词替换唯一标识符键(the
mgsub
)
- 然后你干(使用
stemDocument
)
- 接下来,您将其反转并使用“NO STEM”字样(的
mgsub
)子标识符键
- 最后完成词干 (
stemCompletion
)
这是输出:
## > inspect(corpus.final)
## A corpus with 4 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`1`
## i am member of the XYZ associate
##
## $`2`
## for our open associate position
##
## $`3`
## xyz memorial lecture takes place on wednesday
##
## $`4`
## vote for the most popular lecturer