regex - 读取一列有空格的表格

Question

我正在尝试从非常大的文本文件（计算机日志）中提取表格。Dickoa 在此处为有关此主题的较早问题提供了非常有用的建议：从文本文件中提取表格

我修改了他的建议以适应我的具体问题，并在上面的链接中发布了我的代码。

不幸的是，我遇到了一个并发症。表中的一列包含空格。当我尝试运行上面链接中的代码时，这些空格会产生错误。有没有办法修改该代码，或者特别是read.table将下面的第二列识别为列的函数？

这是虚拟日志中的虚拟表：

> collect.models(, adjust = FALSE)
                                                                           model npar      AICc    DeltaAICc       weight  Deviance
5   AA(~region + state + county + city)BB(~region + state + county + city)CC(~1)   17  11111.11    0.0000000 5.621299e-01  22222.22
4                 AA(~region + state + county)BB(~region + state + county)CC(~1)   14  22222.22    0.0000000 5.621299e-01  77777.77
12                                  AA(~region + state)BB(~region + state)CC(~1)   13  33333.33    0.0000000 5.621299e-01  44444.44
12                                                  AA(~region)BB(~region)CC(~1)    6  44444.44    0.0000000 5.621299e-01  55555.55
> 
> # the three lines below count the number of errors in the code above

这是R我尝试使用的代码。如果第二列（模型列）中没有空格，则此代码有效：

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data  <- my.data[grep(top, my.data):grep(bottom, my.data)]

x <- read.table(text=my.data, comment.char = ">")

我相信我必须使用变量top并bottom在日志中定位表，因为日志是巨大的、可变的和复杂的。此外，并非每个表都包含相同数量的模型。

也许可以使用正则表达式以某种方式利用每个模型名称中的AA和CC(~1)现在，但我不知道如何开始。感谢您的任何帮助，并对后续问题表示抱歉。我应该在我最初的问题中使用更现实的示例表。我有大量的日志。否则我只能手动提取和编辑表格。表本身是一个奇怪的对象，我只能直接用导出capture.output，这可能仍然会给我留下与上述相同的问题。

编辑：

所有空格似乎都出现在加号之前和之后。也许这些信息可以在这里用来填充空间或删除它们。

score 1 · Accepted Answer

尝试插入my.data$model <- gsub(" *\\+ *", "+", my.data$model)之前read.table

my.data  <- my.data[grep(top, my.data):grep(bottom, my.data)]

my.data$model <- gsub(" *\\+ *", "+", my.data$model)

x <- read.table(text=my.data, comment.char = ">")

regex - 读取一列有空格的表格

1 回答 1

Related

Reference