2

我有一个软件可以生成宽度有限的实验数据,这样一串数据点将被包装成一系列行,在最终的 csv 中限制为 4 列宽,而不是每个变量单行(A 和B 下面)这是我需要它的形式。(下面的示例 csv)

A,1,3,3,2
,5,6,7,8
,9,10,11,12
,13,1,15,6
,17,1,2,20
B,1,2,3,7
,7,6,7,8
,9,10,11,12
,13,15,15,16
,17,18,3,2

在真实数据中,这让我每天要处理大约 53,000 行,所以我想知道是否有一个函数可以让我将给定的数据子集(每个变量)解包或重新划分为单行. 在上面的示例中,变量 A 后面的数字将组合成一行,同时保持顺序(即 1、3、3、2、5...),B 也是如此,依此类推。

根据请求,dput 输出生成上述简化示例。

 structure(list(V1 = structure(c(2L, 1L, 1L, 1L, 1L, 3L), .Label = c("", 
 "A", "B"), class = "factor"), V2 = c(1L, 5L, 9L, 13L, 17L, 1L
 ), V3 = c(2L, 6L, 10L, 14L, 18L, 2L), V4 = c(3L, 7L, 11L, 15L, 
 19L, 3L), V5 = c(4L, 8L, 12L, 16L, 20L, 4L)), .Names = c("V1", 
 "V2", "V3", "V4", "V5"), row.names = c(NA, 6L), class = "data.frame")
4

5 回答 5

3

您可以使用外部工具来预处理文件,

read.csv(pipe("sed -e :a -e '$!N;s/\\n,//;ta' -e 'P;D' file.txt"), head=FALSE)

本质上,file.txt首先由 unix 工具处理,该工具sed执行搜索和替换并将新内容返回给 R。我改编自此页面的正则表达式以执行以下任务:

  If a line begins with a comma, append it to the previous line 
  and replace the "," with nothing

编辑(eddi - 注意:这似乎不适用于 Mac OS)以下是以下命令的解析方式sed

read.csv(pipe("sed ':a; N; s/\\n,/,/; t a; P; D' file.txt"), head=FALSE)

:a       # label (named "a") we're going to come back to
N        # read in the next line into pattern space, together with the newline character
s/\n,/,/ # if there is a newline followed by comma, delete the newline
t a      # go back to "a" and repeat until the above match fails (t stands for test)
P        # print everything in pattern space up to and including last \n
D        # delete everything in pattern space up to and including last \n
于 2013-08-08T17:48:11.163 回答
2

grep、paste 和 read.table 在这里非常方便。

# read in your data raw
X <- read.table("file")

# Any line that does NOT start with a comma, add a line break, 
# then re-read with read.table
read.table(text=paste(ifelse(grepl("^,", X), X, paste("\n", X)), collapse=""), sep=",")

产量:

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21  
1  A  1  3  3  2  5  6  7  8   9  10  11  12  13   1  15   6  17   1   2  20  
2  B  1  2  3  7  7  6  7  8   9  10  11  12  13  15  15  16  17  18   3   2
于 2013-08-08T18:45:08.010 回答
2

这是另一个基础 R 解决方案。它使用gsub()并且简短易读(至少对我而言)。

txt = readLines("file.txt")

# Join into one long string with newlines.
txt_long = paste(txt, collapse="\n")

# Remove newlines directly preceding a comma.
newtxt = gsub("\\n,", ",", txt_long)

read.table(text=newtxt, sep=",")
#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
# 1  A  1  3  3  2  5  6  7  8   9  10  11  12  13   1  15   6  17   1   2  20
# 2  B  1  2  3  7  7  6  7  8   9  10  11  12  13  15  15  16  17  18   3   2
于 2013-08-08T22:40:54.193 回答
1

这有点难看,但这是我想到的第一个通用策略:

library(zoo)
library(plyr)
dat$V1 <- na.locf(dat$V1)
> ddply(dat,.(V1),function(x) c(t(as.matrix(x[,-1]))))
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1  1  3  3  2  5  6  7  8  9  10  11  12  13   1  15   6  17   1   2  20
2  1  2  3  7  7  6  7  8  9  10  11  12  13  15  15  16  17  18   3   2

假设您将数据读入一个名为dat和 used的对象中na.strings = ""。您可以在之后添加A,B变量信息,或者可能将其填充到匿名ddply函数中。

可能有一种方法可以直接使用它来重塑它,dcast但我想不出办法。

于 2013-08-08T17:30:11.513 回答
1

你不只是喜欢仪器制造商吗?

这是一种方法,我认为它不是很完美,因为我无法完全测试没有所有数据,但你可以。

编辑:更新功能

cleanData <- function(df) {
    good <- c() # holds indices of lines that start a row in the final data set
        # Find the 'starter' rows
    for (n in 1:nrow(df)) {
        if (df[n,1] != "") good <- c(good,n)
        }

    # Now go back and put it back together
    # Get one row in 1st to set dimensions

    newDat <- data.frame(mydat = df[(good[1]:(good[2])-1),])
    offset <- nrow(newDat)-1
    data <- as.numeric(t(as.matrix(newDat[,-1])))
    label <- df[1,1]
    newDat <- data.frame(data)
    names(newDat) <- label
    #print(newDat) # OK

    # now do them all
    for (n in 2:length(good)) {
        use <- good[n]:(good[n] + offset)
        data <- as.numeric(t(as.matrix(df[use,-1])))
        label <- df[good[n],1]
        newCol <- data.frame(data)
        names(newCol) <- label
        newDat <- cbind(newDat, newCol)
        }

    newDat
    }

将上面的函数复制并粘贴到R中,然后执行您的数据框来自newTst <- cleanData(tst)哪里。如果有效,请查看或执行.tstread.csvnewTststr(newTst)

根据您的测试数据,它给出:

'data.frame':   20 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10 ...
 $ B: num  1 2 3 4 NA NA NA NA NA NA ...
于 2013-08-08T18:20:13.403 回答