我在 Linux 中执行了 LDA,并且在主题 2 中没有得到像“ø”这样的字符。但是,当在 Windows 中运行时,它们会显示出来。有谁知道如何处理这个?我使用了包quanteda
和topicmodels
.
> terms(LDAModel1,5)
Topic 1 Topic 2
[1,] "car" "ø"
[2,] "build" "ù"
[3,] "work" "network"
[4,] "drive" "ces"
[5,] "musk" "new"
编辑:
数据:https ://www.dropbox.com/s/tdr9yok7tp0pylz/technology201501.csv
代码是这样的:
library(quanteda)
library(topicmodels)
myCorpus <- corpus(textfile("technology201501.csv", textField = "title"))
myDfm <- dfm(myCorpus,ignoredFeatures=stopwords("english"), stem = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
myDfm <-removeFeatures(myDfm, c("reddit", "redditors","redditor","nsfw", "hey", "vs", "versus", "ur", "they'r", "u'll", "u.","u","r","can","anyone","will","amp","http","just"))
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.9999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
LDAModel1 <- LDA(quantedaformat2dtm(myDfm2), 25, 'Gibbs', list(iter=4000,seed = 123))