r - 使用 R 和 koRpus 编译和分析语料库

Question

我是一名迷失在数据科学中的文学学生。我正在尝试分析一个包含 70 个 .txt 文件的语料库，它们都在一个目录中。

我的最终目标是获得一个包含文件名（或类似内容）、句子和字数、Flesch-Kincaid 可读性分数和 MTLD 词汇多样性分数的表格。

我找到了 koRpus 和 tm 包（以及 tm.plugin.koRpus），并试图了解它们的文档，但还没有走多远。在 RKward IDE 和 koRpus-Plugin 的帮助下，我设法一次为一个文件获取所有这些度量，并且可以手动将这些数据复制到一个表中，但这非常麻烦并且仍然需要大量工作。

到目前为止，我尝试的是这个命令来创建我的文件语料库：

simpleCorpus(dir = "/home/user/files/", lang = "en", tagger = "tokenize",
encoding = "UTF-8", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text", source = "Wikipedia", format = "file",
mc.cores = getOption("mc.cores", 1L))

但我总是得到错误：

Error in data.table(token = tokens, tag = unk.kRp):column or argument 1 is NULL).

如果有人可以帮助 R 的绝对新手，我将非常感激！

score 0 · Accepted Answer

这是一个非常全面的演练......如果我是你，我会一步一步地完成。

http://tidytextmining.com/tidytext.html

score 0 · Accepted Answer

我在包的作者 unDocUMeantIt 的帮助下找到了解决方案（谢谢！）。目录中的一个空文件导致错误，删除后我设法让一切运行。

score 0 · Accepted Answer

我建议您看一下我们的quanteda小插图，数字人文用例：使用 R 为文学学生复制文本分析的分析，它复制了 Matt Jocker 的同名书。

对于您在上面寻找的内容，以下内容将起作用：

require(readtext)
require(quanteda)

# reads in all of your texts and puts them into a corpus
mycorpus <- corpus(readtext("/home/user/files/*"))

# sentence and word counts
(output_df <- summary(mycorpus))

# to compute Flesch-Kincaid readability on the texts
textstat_readability(mycorpus, "Flesch.Kincaid")

# to compute lexical diversity on the texts
textstat_lexdiv(dfm(mycorpus))

该textstat_lexdiv()功能目前没有 MLTD，但我们正在努力，它确实有六个其他的。

r - 使用 R 和 koRpus 编译和分析语料库

3 回答 3

Related

Reference