java - 将多个 .doc 合并到一个 .csv 中

Question

我认为这是一个相当不寻常的问题，因为我在任何地方都找不到任何答案。我有大约 100000 个 word 文档（即临床报告字母 - 所以它们都是自由文本，带有逗号、格式等），它们都存储在同一个文件夹中。我希望将它们合并到一个电子表格中（最好是 .csv），以便每个 .doc 占据 .csv 的一行。

为了使问题复杂化，每个 .doc 的前 6 个字符包括每个文件的 ID 号（即 '123456report.doc' - 'report' 名称也可能具有可变长度和字符：即 '123456John Smith report.doc' 或'123457Jack Ryan Rep 01 01 2013.doc'）。最初我将 .doc 存储在包含 ID 号的单个文件夹中（实际上它是一个子文件夹系统，文件夹名称的串联给出了 .doc 的 ID 号，然后我设法将其添加到文件名中） -让我知道这是否有用，我可以更详细地解释）。

因此，.csv 我需要的最终结构是：

ID, Clinical report
123456, clinical text in document 123456report1.doc
123457, clinical text in document 123457report2.doc
123458, clinical text in document 123458report3.doc
...

请注意，该 ID 可能会在数据表中重复（即，如果对一名患者进行多次检查，则为一名患者发布多份报告），并且允许我将该 ID 与包含其他数据的其他电子表格交叉引用，这是必不可少的。

我不确定这是否简单（我想可能不是），但我不知道从哪里开始。我什至不确定实现这一目标的最佳环境，所以任何提示将不胜感激！即使这包括购买一些专门为此类任务设计的软件。

非常感谢，马可

score 0 · Accepted Answer

问题解决了。这是我的脚本，似乎在数据的子样本中运行良好。非常感谢大家。另外，我还设法从标题中提取了日期（我将其排除在原始问题之外以避免进一步复杂化 - 因此额外的几行代码）。

files     <- list.files(pattern = "\\.(txt)")
files.ID  <- substr(basename(files), 1, 7)  #SUBSTR() takes the first 7 characters of the name of each file

#TO OBTAIN THE DATE FROM THE FILE TITLE
a <- unlist(strsplit(unlist(files), "[^0-9]+"))  #takes all the numeric sequences from each string in the vector "files" - the first one is a space (all filenames have a space as first character - the second is the ID, the third is the date as DDMMYY ("010513")
b <- a[seq(3, length(a), 3)]  #I take only the every 3rd string which is the sequence of the date.
d <- paste(substr(b,1,2),"/",substr(b,3,4),"/",substr(b,5,6), sep="") #creates the date as dd/mm/yy
files.date <- as.POSIXct(d,format="%d/%m/%Y")

x <- length(files)
j <- 1
reports<-data.frame(matrix(0,x,3))
names(reports)<-c("ID","date","text") #creates data frame with columns ID and Text
for (i in 1:x) {
  texto<-paste(readLines(files[i]),collapse="\n ")
  strip(texto,char.keep=c(".","?","!","-","+","±","~","=","&","%","$","£","@","*","(",")",":",";",">","<"))
  reports$ID[i] <- files.ID[i]
  reports$date[i] <- files.date[i]
  reports$text[i] <- texto
}

score 0 · Accepted Answer

在R您可以使用循环来处理充满文件的目录并在循环内，使用read.transcript从包qdap中读取文件并处理它们。 qdap还会为你做一些文本分析。该软件包的作者经常在 SO 上，您可能会从他那里得到更完整的答案。但是阅读qdap可能是您获得良好开端所需的全部内容。关于制作循环的问题和处理文件的细节将适用于另一个问题（尽管关于 SO 已经有很多这样的问题，您可能可以通过搜索 SO 找到您需要的内容）。但是这里有一个简单的循环结构来给你这个想法：

files <- list.files(pattern = "\\.(docx|DOCX)")
files.noext <- substr(basename(files), 1, nchar(basename(files)) - 4)
out.files <- paste(files.noext, "csv", sep = "")

for (i in 1:length(files)) {
    # process the files here with qdap, accumulating the results into a new
    # structure to be determined; write out as csv
    # you might need two passes, one to unpack the docx, then one to assemble them
    # into a single structure for further analysis
    }

java - 将多个 .doc 合并到一个 .csv 中

2 回答 2

Related

Reference