1

你好,文本矿工,

我对该领域相当陌生,我正在尝试使用 quanteda 的 textstat_simil(R 包)来评估短语之间的相似性。这些步骤非常前期 - 因此我相信我遗漏了一些明显的东西,但我仍然无法让功能选择工作:

    #1 Create Corpus
myCorpus <- corpus(c("Anna, Maria, Luisa"))
checkWords <- c('Luisianna', 'anneta')
summary(myCorpus)

myDfm <- dfm(myCorpus)
myDfm # checking that features are there.
#removing stopwords & punctuation/ keep the stem 
myDfmNoStop <- dfm(myCorpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)

sim <- textstat_simil(myDfmNoStop , checkWords, method = "cosine", margin = "features")

返回错误:

“textstat_simil.dfm 中的错误(myDfm,checkWords,method = “cosine”,margin = “features”):“selection”指定的特征不存在。”

所以,我不清楚如何指定我的coprus的正确特征/单词?

不用说 - 任何反馈都非常受欢迎:)

干杯,

乔治

4

1 回答 1

2

尝试这个:

myCorpus <- corpus(c(check = "Luisianna, anneta", 
                     target1 = "Anna, Maria, Luisa",
                     target2 = "Anna, anneta"))

myDfmNoStop <- dfm(myCorpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)

sim <- textstat_simil(myDfmNoStop , myDfmNoStop['check',], method = "cosine", margin = "documents")
于 2017-11-15T10:42:12.417 回答