r - 使用“textcat”从语料库中过滤掉非英语单词

Question

与此 SO member 类似，我一直在寻找 R 中的一个简单包，它可以过滤掉非英语单词。例如，我可能有一个看起来像这样的单词列表：

Flexivel
eficaz
gut-wrenching
satisfatorio
apropiado
Benutzerfreundlich
interessante
genial
cool
marketing
clients
internet

我的最终目标是简单地从语料库中过滤掉非英语单词，以便我的列表很简单：

gut-wrenching
cool
marketing
clients
internet

我已将数据读取为data.frame，尽管随后将其转换为语料库，然后转换为 TermDocumentMatrix 以使用wordcloudand创建 wordcloud tm。

我目前正在使用该包textcat按语言进行过滤。该文档有点超出我的想象，但似乎表明您可以textcat在列表上运行该命令。例如，如果上面的数据在一个df名为“words”的单列调用的 data.frame 中，我将运行以下命令：

library(textcat)
textcat(c(df$word))

但是，这具有将整个单词列表作为单个文档读取的效果，而不是查看每一行并确定其语言。请帮忙！

score 0 · Accepted Answer

对于字典搜索，您可以使用aspell：

txt <- c("Flexivel", "eficaz", "gut-wrenching", "satisfatorio", "apropiado",
  "Benutzerfreundlich", "interessante", "genial", "cool", "marketing",
  "clients", "internet")

fn <- tempfile()
writeLines(txt, fn)
result <- aspell(fn)

results$Original给出不匹配的单词。您可以从中选择匹配的单词：

> result$Original
[1] "Flexivel"           "eficaz"             "satisfatorio"      
[4] "apropiado"          "interessante"       "Benutzerfreundlich"
> english <- txt[!(txt %in% result$Original)]
> english
[1] "gut-wrenching" "genial"        "cool"          "marketing"    
[5] "clients"       "internet"

但是，正如 Carl Witthoft 所指出的，您不能确定这些实际上是英语单词。例如，“cool”、“marketing”和“internet”也是有效的荷兰语单词。

r - 使用“textcat”从语料库中过滤掉非英语单词

1 回答 1

Related

Reference