我正在使用 tidymodels 框架进行 NLP,利用 textrecipes 包,该包具有用于文本预处理的配方步骤。在这里,step_tokenize
将字符向量作为输入并返回一个tokenlist
对象。现在,我想使用 hunspell 包中的函数使用自定义函数对新的标记化变量执行拼写检查,以确保正确拼写,但出现以下错误(链接到拼写检查博客文章):
Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE
显然,tokenlists 不容易解析为字符向量。我注意到 的存在step_untokenize
,但只是通过粘贴和折叠来解散令牌列表,这不是我需要的。
代表
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_mutate(desc = correct_spelling(desc))
product_recipe %>% prep()
我想要什么,但没有食谱
product_descriptions %>%
unnest_tokens(word, desc) %>%
mutate(word = correct_spelling(word))