r - 如何将officer::read_docx应用于整个文件夹

Question

我正在尝试扫描许多文档，目的是将文本重新组织成标准格式。这涉及使用提取表格docxtractr，并使用单独提取正文文本textreadr，或者使用officer::docx_summary标记正文和表格文本以便于操作。对于这个问题，我正在使用officer::read_docxand officer::docx_summary。我正在使用的测试文档是.docx, 并且在包含文本和数字的表格之前和之后包含无意义的文本。

我的代码是：

dir <- "C:/path/to/documents"
filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- officer::docx_summary(lapply(filenames, officer::read_docx))

理想情况下，它将生成包含docx_summary信息的数据帧列表。我尝试lapply在文件名列表上使用，但在尝试查看时输出列表出现错误：

Error in names[[i]]: subscript out of bounds.

score 1 · Accepted Answer

officer::docx_summary用于返回的对象officer::read_docx，不支持列表...

filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- lapply(filenames, function(x) officer::docx_summary(officer::read_docx(x)) )

r - 如何将officer::read_docx应用于整个文件夹

1 回答 1

Related

Reference