r - 如何使用 tm 创建 *.docx 文件的语料库？

Question

我有一个混合文件类型的 MS Word 文档集合。有些文件是 *.doc，有些是 *.docx。我正在学习使用tm，并且我（或多或少*）使用以下方法成功创建了一个由 *.doc 文件组成的语料库：

ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), 
                 readerControl=list(reader=readDOC, 
                                    language='en_CA',
                                    load=TRUE));

此命令不处理 *.docx 文件。我假设我需要一个不同的读者。从这篇文章中，我了解到我可以自己编写（鉴于对我目前没有的 .docx 格式有很好的理解）。

readDOC 阅读器使用antiword来解析 *.doc 文件。是否有类似的应用程序可以解析 *.docx 文件？

或者更好的是，是否已经有一种使用 tm 创建 *.docx 文件语料库的标准方法？

* 或多或少，因为虽然文件进入并且是可读的，但对于每个文档，我都会收到以下警告：In readLines(y, encoding = x$Encoding) : incomplete final line found on 'path/to/a/file.doc'

score 5 · Accepted Answer

.docx文件是压缩的 XML 文件。如果你执行这个：

> uzfil <- unzip(file.choose())

然后.docx在你的目录中选择一个文件，你会得到：

> str(uzfil)
 chr [1:13] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels" ...
> uzfil
 [1] "./[Content_Types].xml"          "./_rels/.rels"                  "./word/_rels/document.xml.rels"
 [4] "./word/document.xml"            "./word/theme/theme1.xml"        "./docProps/thumbnail.jpeg"     
 [7] "./word/settings.xml"            "./word/webSettings.xml"         "./word/styles.xml"             
[10] "./docProps/core.xml"            "./word/numbering.xml"           "./word/fontTable.xml"          
[13] "./docProps/app.xml"

这也会以静默方式将所有这些文件解压缩到您的工作目录中。该"./word/document.xml"文件包含您要查找的单词，因此您可能可以使用 XML 包中的一种 XML 工具来阅读它们。我猜你会按照以下方式做一些事情：

 library(XML)
 xtext <-  xmlTreeParse(unz(uzfil[4]), useInternalNodes = TRUE) )

实际上，您可能需要将其保存到临时目录并将该路径添加到文件名“./word/document.xml”。

您可能希望在此答案中使用@GaborGrothendieck 提供的进一步步骤：如何使用 R 从 CrossRef 中提取 xml 数据？

score 0 · Accepted Answer

我最终使用docx2txt将 .docx 文件转换为文本。然后我像这样从他们创建了一个语料库：

ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), 
                 readerControl=list(reader=readPlain, 
                                    language='en_CA',
                                    load=TRUE));

我想我可能会破解 readDOC 阅读器，以便它可以根据需要使用 docx2txt 或 antiword，但这很有效。

r - 如何使用 tm 创建 *.docx 文件的语料库？

2 回答 2

Related

Reference