r - R 使用 %in% 从字符向量中删除停用词

Question

我有一个带有字符串的数据框，我想从中删除停用词。我试图避免使用该tm包，因为它是一个大型数据集，并且tm运行速度似乎有点慢。我正在使用tm stopword字典。

library(plyr)
library(tm)

stopWords <- stopwords("en")
class(stopWords)

df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."

head(df1)
df1$string1 <- tolower(df1$string1)
str1 <-  strsplit(df1$string1[5], " ")

> !(str1 %in% stopWords)
[1] TRUE

这不是我要找的答案。我正在尝试获取不在向量中的单词的向量或字符串stopWords。

我究竟做错了什么？

score 15 · Accepted Answer

您没有正确访问列表，也没有从%in%（给出 TRUE/FALSE 的逻辑向量）的结果中取回元素。你应该这样做：

unlist(str1)[!(unlist(str1) %in% stopWords)]

（或者）

str1[[1]][!(str1[[1]] %in% stopWords)]

对于整个data.framedf1，您可以执行以下操作：

'%nin%' <- Negate('%in%')
lapply(df1[,2], function(x) {
    t <- unlist(strsplit(x, " "))
    t[t %nin% stopWords]
})

# [[1]]
# [1] "string"  "string."
# 
# [[2]]
# [1] "string"   "slightly" "string." 
# 
# [[3]]
# [1] "string"  "string."
# 
# [[4]]
# [1] "string"   "slightly" "shorter"  "string." 
# 
# [[5]]
# [1] "string"   "string"   "strings."

score 6 · Accepted Answer

第一的。如果是向量，您应该取消列出str1或使用：lapplystr1

!(unlist(str1) %in% words)
#>  [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

第二。复杂的解决方案：

string <- c("This string is a string.",
            "This string is a slightly longer string.",
            "This string is an even longer string.",
            "This string is a slightly shorter string.",
            "This string is the longest string of all the other strings.")
rm_words <- function(string, words) {
    stopifnot(is.character(string), is.character(words))
    spltted <- strsplit(string, " ", fixed = TRUE) # fixed = TRUE for speedup
    vapply(spltted, function(x) paste(x[!tolower(x) %in% words], collapse = " "), character(1))
}
rm_words(string, tm::stopwords("en"))
#> [1] "string string."                  "string slightly longer string."  "string even longer string."     
#> [4] "string slightly shorter string." "string longest string strings."

score 0 · Accepted Answer

当我在做类似的事情时遇到了这个问题。

虽然这已经得到了回答，但我只是想提出一个简洁的代码行，我也用它来解决我的问题 - 这将有助于直接消除数据框中的所有停用词：

df1$string1 <- unlist(lapply(df1$string1, function(x) {paste(unlist(strsplit(x, " "))[!(unlist(strsplit(x, " ")) %in% stopWords)], collapse=" ")}))

r - R 使用 %in% 从字符向量中删除停用词

3 回答 3

Related

Reference