r - Quanteda：如何删除我自己的单词列表

Question

由于在 quanteda 中没有现成的波兰语停用词实现，我想使用我自己的列表。我把它放在一个文本文件中，作为一个用空格分隔的列表。如果需要，我还可以准备一个由新行分隔的列表。

如何从我的语料库中删除自定义的长长的停用词列表？干了之后怎么办？

我尝试过创建各种格式，转换为字符串向量，如

stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)

我也尝试在语法中使用这样的词向量

myStemMat <-
  dfm(
    mycorpus,
    remove = as.vector(stopwordsPL),
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3)
  )

dfm_trim(myStemMat, sparsity = stopwordsPL)

或者

myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))

没有任何效果。我的停用词出现在语料库和分析中。应用自定义停用词的正确方法/语法应该是什么？

score 10 · Accepted Answer

假设您polish.stopwords.txt是这样的，那么您应该可以通过这种方式轻松地将它们从您的语料库中删除：

stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8")

dfm(mycorpus,
    remove = stopwordsPL,
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3))

使用readtext的解决方案不起作用，因为它将整个文件作为一个文档读取。要获取单个单词，您需要对其进行标记并将标记强制转换为字符。大概readLines()比较容易。

无需从中创建字典stopwordsPL，因为remove应该采用字符向量。此外，恐怕还没有实施波兰语词干分析器。

目前（v0.9.9-65）中的特征删除dfm()并没有摆脱形成二元组的停用词。要覆盖它，请尝试：

# form the tokens, removing punctuation
mytoks <- tokens(mycorpus, remove_punct = TRUE)
# remove the Polish stopwords, leave pads
mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE)
## can't do this next one since no Polish stemmer in 
## SnowballC::getStemLanguages()
# mytoks <- tokens_wordstem(mytoks, language = "polish")
# form the ngrams
mytoks <- tokens_ngrams(mytoks, n = c(1, 3))
# construct the dfm
dfm(mytoks)

r - Quanteda：如何删除我自己的单词列表

1 回答 1

Related

Reference