tm - 使用 r-tm 读取文档以使用 r-mallet

Question

我有这段代码来拟合主题模型和MALLET 的 R 包装器：

docs <- mallet.import(DF$document, DF$text, stop_words)

mallet_model <- MalletLDA(num.topics = 4)
mallet_model$loadDocuments(docs)
mallet_model$train(100)

我已经使用tm包来读取我的文档，这些文档是目录中的 txt 文件：

myCorpus <- Corpus(DirSource("data")) # a directory of txt files

语料库不能用作的输入mallet.import，那么我如何从myCorpus上面的 tm 语料库DF到调用的？

score 2 · Accepted Answer

RMallet 旨在成为一个独立的包，因此与 tm 的集成不是很好。RMallet 输入的要求是每个文档有一行的数据框，以及包含文本的字符字段，它预计尚未被标记化。

score 1 · Accepted Answer

您可以使用 tidy data 原则来处理您的文本并准备好输入到 mallet 中，每个文档一行，如此处所述。

此外，tidytext 中有mallet 包的整理器，您可以使用它们来分析 mallet 主题建模的输出：

# word-topic pairs
tidy(mallet_model)

# document-topic pairs
tidy(mallet_model, matrix = "gamma")

# column needs to be named "term" for "augment"
term_counts <- rename(word_counts, term = word)
augment(mallet_model, term_counts)

tm - 使用 r-tm 读取文档以使用 r-mallet

2 回答 2

Related

Reference