像这样删除单词的结尾称为词干提取,如果您愿意,R 中有几个包可以为您做到这一点。一个是 rOpenSci 的hunspell 包,另一个选项是实现 Porter 算法词干的 SnowballC 包。你可以这样实现:
library(dplyr)
library(tidytext)
library(SnowballC)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 i
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 i
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
请注意,它会阻止您的所有文本,并且某些单词看起来不再像真实单词了;你可能关心也可能不关心。
如果您不想使用像 SnowballC 或 hunspell 这样的词干分析器来词干所有文本,您可以使用 dplyr's insideif_else
来mutate()
替换特定的词。
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
str_replace
或者,从 stringr 包
中使用可能更有意义。
library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows