r - 如何使用 readtext 将多个 JSON 文件加载到 quanteda 语料库中？

Question

我正在尝试将大量 JSON 文件从新闻网站加载到 quanteda 语料库中readtext。为了简化过程，JSON 文件都在工作目录中。但我也在他们自己的目录中尝试过它们。

当c()用于创建显式定义一小部分文件的变量时，readtext可以按预期工作，并且使用corpus().
list.files()当尝试使用列出所有 +1500 JSON 文件来创建变量时readtext，无法按预期工作，将返回错误，并且不会创建语料库。

我试图检查定义文本集（即c()和list.files()）以及paste0().

# Load libraries
library(readtext)
library(quanteda)

# Define a set of texts explicitly
a <- c("border_2020_05_10__1589150513.json","border_2020_05_10__1589143358.json","border_2020_05_07__1589170960.json")

# This produces a corpus
extracted_texts <- readtext(a, text_field = "maintext")
my_corpus <- corpus(extracted_texts)

# Define a set of all texts in working directory
b <- list.files(pattern = "*.json", full.names = F)

# This, which I hope to use, produces an error
extracted_texts <- readtext(b, text_field = "maintext")
my_corpus <- corpus(extracted_texts)

产生的错误extracted_texts <- readtext(b, text_field = "maintext")如下

File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.

这很令人困惑，因为调用 with 的相同文件a不会产生错误。我验证了几个 JSON 文件，它们在每种情况下都返回 VALID (RFC 8259)，即 JSON 的IETF 标准。

a检查和之间的差异b：

typeof()返回和。"character"_ab
is.vector()并is.atomic()返回TRUE两者。
is.list()两者都返回FALSE。
它们在 RStudio 中和在控制台中调用时看起来相似

我真的很困惑为什么a有效而b无效。

最后，尝试完全模仿readtext 文档中使用的程序，还尝试了以下操作：

# XXXX = my username
data_dir <- file.path("C:/Users/XXXX/Documents/R/")

d <- readtext(paste0(data_dir, "/corpus_linguistics/*.json"), text_field = "maintext")

这也返回了错误

File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.

在这一点上，我很难过。提前感谢您对如何前进的任何见解。

解决方案和总结

不干净的数据：一些输入 JSON 文件有一个空main_text字段。这些对分析没有用，应该删除。"title_rss"所有文件都包含一个名为null的 JSON 字段。这可以通过目录级别的查找和替换来消除，用 Notepad ++，或者可能是 R 或 Python，尽管我仍然缺乏这方面的技能。此外，这些文件不是 UTF-8 编码，这已通过Codepage Converter解决。
调用目录字符串的list.files()方法：readtext How to Use文档和一些第三方教程中使用了该方法。此方法适用于 *.txt 文件，但由于某种原因，它似乎不适用于这些特定的 JSON 文件。一旦 JSON 文件被正确清理和编码，下面的方法就可以正常工作。如果它data_dir被包装在一个list.files()函数中，它会产生以下错误： Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist.我不知道为什么会这样，但是将其保留对这些 JSON 文件有效。

# Load libraries
library(readtext)
library(quanteda)

# Define a set of texts explicitly
data_dir <- "C:/Users/Nathan/Documents/R/corpus_linguistics/"
extracted_texts <- readtext(paste0(data_dir, "texts_unmodified/*.json"), text_field = "maintext", verbosity = 3)
my_corpus <- corpus(extracted_texts)

使用未修改的文件进行测试，其中一个已知有空字段

输入： 5 个文件，其中 4 个不带空或 nulltext_field和 1 个带 null 的文件text field。此外，所有文件都具有西欧 (Windows) 1252 编码。

错误：

Reading texts from C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/*.json
, using glob pattern
 ... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_02_17__1589147645.json
File doesn't contain a single valid JSON object.
 contain a single valid JSON object.
 ... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_03_13__1589150325.json
File doesn't contain a single valid JSON object.
Column 14 ['maintext'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform. ... read 5 documents.

结果：由 5 个文档组成的正确格式的语料库。一个文档缺少标记或类型。尽管存在错误，但语料库似乎可以正常构建。由于编码问题，可能某些特殊字符无法正确显示。我无法检查这一点。

使用已知没有空字段的已清理文件进行测试

输入文件： 4 个没有空或空 JSON 字段的文件。在所有情况下，都text_field包含文本并且该title_rss字段已被删除。每个文件都从西欧 (Windows) 1252 转换为 Unicode UTF-8-65001。

错误：无！

结果：正确形成的语料库。

非常感谢两位开发人员的详细反馈和有用的线索。对援助深表感谢。

score 3 · Accepted Answer

这里有几种可能性，但最有可能的是：

从readtext(). 即使从严格的 JSON 格式来看这可能是可以的，但例如，如果您的文本字段之一为空，那么这将导致错误。（请参阅下面的演示和解决方案。）
虽然readtext()可以采用“glob”模式匹配，但list.files()采用正则表达式。您有可能（但不太可能）在list.files(pattern = "*.json".... 但这不应该是必要的readtext()-- 见下文。

为了演示，让我们将每个文档data_corpus_inaugural写成一个单独的 JSON 文件，然后使用readtext().

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

tmpdir <- tempdir()
corpdf <- convert(data_corpus_inaugural, to = "data.frame")
for (d in corpdf$doc_id) {
  cat(jsonlite::toJSON(dplyr::filter(corpdf, doc_id == d)),
    file = paste0(tmpdir, "/", d, ".json")
  )
}

head(list.files(tmpdir))
## [1] "1789-Washington.json" "1793-Washington.json" "1797-Adams.json"     
## [4] "1801-Jefferson.json"  "1805-Jefferson.json"  "1809-Madison.json"

要读取它们，您可以在此处使用“glob”模式补丁，然后只需读取 JSON 文件。

rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
  text_field = "text", docid_field = "doc_id"
)
summary(corpus(rt), n = 5)
## Corpus consisting of 58 documents, showing 5 documents:
## 
##                  Text Types Tokens Sentences Year  President FirstName
##  1789-Washington.json   625   1537        23 1789 Washington    George
##  1793-Washington.json    96    147         4 1793 Washington    George
##       1797-Adams.json   826   2577        37 1797      Adams      John
##   1801-Jefferson.json   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson.json   804   2380        45 1805  Jefferson    Thomas
##                  Party
##                   none
##                   none
##             Federalist
##  Democratic-Republican
##  Democratic-Republican

所以一切都很好。

但是如果我们添加到这个文本字段为空的文件中，那么这会产生有问题的错误：

cat('[ { "doc_id" : "d1", "text" : "this is a file" },
       { "doc_id" : "d2", "text" :  } ]',
  file = paste0(tmpdir, "/badfile.json")
)
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
  text_field = "text", docid_field = "doc_id"
)
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.

没错，这不是一个有效的 JSON 文件，因为它包含一个没有值的标签。但我怀疑你的一个文件中有类似的东西。

以下是您识别问题的方法：遍历您的b（来自问题，而不是我在下面指定的）。

b <- tail(list.files(tmpdir, pattern = ".*\\.json", full.names = TRUE))
for (f in b) {
  cat("Reading:", f, "\n")
  rt <- readtext::readtext(f, text_field = "text", docid_field = "doc_id")
}
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2001-Bush.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2005-Bush.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2009-Obama.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2013-Obama.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2017-Trump.json 
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/badfile.json 
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.

r - 如何使用 readtext 将多个 JSON 文件加载到 quanteda 语料库中？

解决方案和总结

使用未修改的文件进行测试，其中一个已知有空字段

使用已知没有空字段的已清理文件进行测试

1 回答 1

Related

Reference