1

我正在使用 tidymodels 框架进行 NLP,利用 textrecipes 包,该包具有用于文本预处理的配方步骤。在这里,step_tokenize将字符向量作为输入并返回一个tokenlist对象。现在,我想使用 hunspell 包中的函数使用自定义函数对新的标记化变量执行拼写检查,以确保正确拼写,但出现以下错误(链接到拼写检查博客文章):

Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE

显然,tokenlists 不容易解析为字符向量。我注意到 的存在step_untokenize,但只是通过粘贴和折叠来解散令牌列表,这不是我需要的。

代表

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)

product_descriptions <- tibble(
  desc = c("goood product", "not sou good", "vad produkt"),
  price = c(1000, 700, 250)
)

correct_spelling <- function(input) {
  output <- case_when(
    # check and (if required) correct spelling
    !hunspell_check(input, dictionary('en_US')) ~
      hunspell_suggest(input, dictionary('en_US')) %>%
      # get first suggestion, or NA if suggestions list is empty
      map(1, .default = NA) %>%
      unlist(),
    TRUE ~ input # if word is correct
  )
  # if input incorrectly spelled but no suggestions, return input word
  ifelse(is.na(output), input, output)
}

product_recipe <- recipe(desc ~ price, data = product_descriptions) %>% 
  step_tokenize(desc) %>% 
  step_mutate(desc = correct_spelling(desc))

product_recipe %>% prep()

我想要什么,但没有食谱

product_descriptions %>% 
  unnest_tokens(word, desc) %>% 
  mutate(word = correct_spelling(word))
4

1 回答 1

1

目前还没有使用 {textrecipes} 执行此操作的规范方法。我们需要两件事,一个接受标记向量并返回经过拼写检查的标记(您提供)的函数,以及将该函数应用于tokenlist. 目前,没有一个通用的步骤可以让你这样做,但你可以通过将函数传递给custom_stemmerin来欺骗它step_stem()。给你想要的结果

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(textrecipes)
library(hunspell)

product_descriptions <- tibble(
  desc = c("goood product", "not sou good", "vad produkt"),
  price = c(1000, 700, 250)
)

correct_spelling <- function(input) {
  output <- case_when(
    # check and (if required) correct spelling
    !hunspell_check(input, dictionary('en_US')) ~
      hunspell_suggest(input, dictionary('en_US')) %>%
      # get first suggestion, or NA if suggestions list is empty
      map(1, .default = NA) %>%
      unlist(),
    TRUE ~ input # if word is correct
  )
  # if input incorrectly spelled but no suggestions, return input word
  ifelse(is.na(output), input, output)
}

product_recipe <- recipe(desc ~ price, data = product_descriptions) %>% 
  step_tokenize(desc) %>% 
  step_stem(desc, custom_stemmer = correct_spelling) %>%
  step_tf(desc)

product_recipe %>% 
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 3 × 6
#>   price tf_desc_cad tf_desc_good tf_desc_not tf_desc_product tf_desc_sou
#>   <dbl>       <dbl>        <dbl>       <dbl>           <dbl>       <dbl>
#> 1  1000           0            1           0               1           0
#> 2   700           0            1           1               0           1
#> 3   250           1            0           0               1           0
于 2021-11-18T17:58:38.247 回答