4

我有一个这样的示例数据集:

 8  02-Model (Minimum)
250.04167175293  17.4996566772461
250.08332824707  17.5000038146973
250.125  17.5008907318115
250.16667175293  17.5011672973633
250.20832824707  17.5013771057129
250.25   17.502140045166
250.29167175293  17.5025615692139
250.33332824707  17.5016822814941
 7  03 (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506

数据文件的第一列表示该特定数据的行数(即 02-模型(最小))。然后在 8 行之后我有另一行7 03 (Maximum),这意味着对于 03(最大值)我将有 7 行数据。

我写的函数如下:

readts <- function(x)
{
  path <- x
  # Read the first line of the file
  hello1 <- read.table(path, header = F, nrows = 1,sep="\t")
  tmp1 <- hello1$V1
  # Read the data below first line
  hello2 <- read.table(path, header = F, nrows = (tmp1), skip = 1, 
                       col.names = c("Time", "value"))
  hello2$name <- c(as.character(hello1$V2))
  # Read data for the second chunk
  hello3 <- read.table(path, header = F, skip = (tmp1 + 1), 
                       nrows = 1,sep="\t")
  tmp2 <- hello3$V1
  hello4 <- read.table(path, header = F, skip = (tmp1 + 2), 
                       col.names = c("Time", "value"),nrows=tmp2)
  hello4$name <- c(as.character(hello3$V2))
  # Combine data to create a dataframe
  df <- rbind(hello2, hello4)
  return(df)
}

我得到的输出如下:

> readts("jdtrial.txt")
       Time    value               name
1  250.0417 17.49966 02-Model (Minimum)
2  250.0833 17.50000 02-Model (Minimum)
3  250.1250 17.50089 02-Model (Minimum)
4  250.1667 17.50117 02-Model (Minimum)
5  250.2083 17.50138 02-Model (Minimum)
6  250.2500 17.50214 02-Model (Minimum)
7  250.2917 17.50256 02-Model (Minimum)
8  250.3333 17.50168 02-Model (Minimum)
9  250.0417 17.50206       03 (Maximum)
10 250.0833 17.50115       03 (Maximum)
11 250.1250 17.50113       03 (Maximum)
12 250.1667 17.50124       03 (Maximum)
13 250.2083 17.50160       03 (Maximum)
14 250.2500 17.50247       03 (Maximum)
15 250.2917 17.50432       03 (Maximum)

jdtrial.txt 是我上面显示的数据。但是,当我有带有多个分隔符的大数据时,我的函数不起作用,我需要添加更多行,这会使函数更加混乱。有没有更简单的方法来读取这样的数据文件?谢谢。

预期的数据是我得到的数据。您可以尝试的数据:

 8  02-Model (Minimum)
250.04167175293  17.4996566772461
250.08332824707  17.5000038146973
250.125  17.5008907318115
250.16667175293  17.5011672973633
250.20832824707  17.5013771057129
250.25   17.502140045166
250.29167175293  17.5025615692139
250.33332824707  17.5016822814941
 7  03 (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506
 8  04-Model (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506
250.33332824707  17.5055828094482
4

4 回答 4

3

目前尚不清楚多个分隔符指的是什么,但这是一个解决您实际显示的数据的解决方案。

使用 using 读取数据fill=TRUE以填充空白字段。使用 跟踪哪些行是标题is.hdr。转换V2为数字(在标题行中替换V2NA,这样它们就不会生成警告)。然后在接下来的两列中用 NA 替换非标题行,并使用na.locf (link)用标题填充 NA。最后,只保留非标题行。

library(zoo)
DF <- read.table("jdtrial.txt", fill = TRUE, as.is = TRUE)

is.hdr <- DF$V3 != ""
transform(DF, 
    V2 = as.numeric(replace(V2, is.hdr, NA)),
    V3 = na.locf(ifelse(is.hdr, V2, NA)),
    name = na.locf(ifelse(is.hdr, V3, NA)))[!is.hdr, ]

最后一条语句的结果是:

         V1       V2       V3      name
2  250.0417 17.49966 02-Model (Minimum)
3  250.0833 17.50000 02-Model (Minimum)
4  250.1250 17.50089 02-Model (Minimum)
5  250.1667 17.50117 02-Model (Minimum)
6  250.2083 17.50138 02-Model (Minimum)
7  250.2500 17.50214 02-Model (Minimum)
8  250.2917 17.50256 02-Model (Minimum)
9  250.3333 17.50168 02-Model (Minimum)
11 250.0417 17.50206       03 (Maximum)
12 250.0833 17.50115       03 (Maximum)
13 250.1250 17.50113       03 (Maximum)
14 250.1667 17.50124       03 (Maximum)
15 250.2083 17.50160       03 (Maximum)
16 250.2500 17.50247       03 (Maximum)
17 250.2917 17.50432       03 (Maximum)
19 250.0417 17.50206 04-Model (Maximum)
20 250.0833 17.50115 04-Model (Maximum)
21 250.1250 17.50113 04-Model (Maximum)
22 250.1667 17.50124 04-Model (Maximum)
23 250.2083 17.50160 04-Model (Maximum)
24 250.2500 17.50247 04-Model (Maximum)
25 250.2917 17.50432 04-Model (Maximum)
26 250.3333 17.50558 04-Model (Maximum)
于 2013-07-12T03:08:05.550 回答
1

使用 读取数据readLines,然后依次执行每个数据块。这避免了必须对模型名称做出假设或摆弄正则表达式。您确实必须使用循环而不是[sl]apply,但实际上,这并没有错。

readFile <- function(file)
{
    con <- readLines(file)
    i <- 1
    chunks <- list()
    while(i < length(con))
    {
        type <- scan(text=con[i], what=character(2), sep="\t")
        nlines <- as.numeric(type[1])
        dat <- cbind(read.delim(text=con[i+seq_len(nlines)], header=FALSE),
                     type=type[2])
        chunks <- c(chunks, list(dat))
        i <- i + nlines + 1
    }
    do.call(rbind, chunks)
}
于 2013-07-12T02:31:29.253 回答
1

Edit to replace my original answer in light of @G.Grothendieck's far better answer. This is largely a variation on that answer.

Another go, where for the purposes of demonstration, test is just the raw text like:

test <-" 1  02-Model (Minimum)
250.04167175293  17.4996566772461
 1  03 (Maximum)
250.04167175293  17.5020561218262
 1  04-Model (Maximum)
250.04167175293  17.5020561218262"

Process it:

interm <- read.table(
  text = test, fill = TRUE, as.is = TRUE,
  col.names=c("Time","Value","Name")
)

keys <- which(interm$Name != "")

interm$Name <- rep(
  apply(interm[keys,][-1],1,paste0,collapse=""), 
  diff(c(keys,nrow(interm)+1))
)

result <- interm[-(keys),]

Result:

      Time            Value              Name
2 250.0417 17.4996566772461 02-Model(Minimum)
4 250.0417 17.5020561218262       03(Maximum)
6 250.0417 17.5020561218262 04-Model(Maximum)
于 2013-07-12T02:35:22.787 回答
1

这是一个似乎适用于您的示例数据的函数。它返回 a listof data.frames,但如果您愿意,可以使用它do.call(rbind, ...)来获取单个。data.frame

myFun <- function(textfile) {
  # Read the lines of your text file
  x <- readLines(textfile)
  # Identify lines that start with space followed
  #  by numbers followed by space followed by
  #  numbers. By the looks of it, matching the
  #  space at the start of the line might be
  #  sufficient at this stage.
  myMatch <- grep("^\\s[0-9]+\\s+[0-9]+", x)
  # Extract the first number, which tells us how
  #  many values need to be read in.
  scanVals <- as.numeric(gsub("^\\s+([0-9]+)\\s+.*", 
                              "\\1", x[myMatch]))
  # Extract. I've used seq_along which is like 
  #  1:length(myMatch)
  temp <- lapply(seq_along(myMatch), function(y) {
    # scan will return just a single vector, but your
    #  data are in pairs, so we convert the vector to
    #  a matrix filled in by row
    t1 <- matrix(scan(textfile, skip = myMatch[y], 
                      n = scanVals[y]*2), ncol = 2, 
                 byrow = TRUE)
    # Add column names to the matrix
    colnames(t1) <- c("time", "value")
    # Convert the matrix to a data.frame and add the 
    #  name column using cbind.
    cbind(data.frame(t1), 
          name = gsub("^\\s+([0-9]+)\\s+(.*)", "\\2", 
                      x[myMatch])[y])
  })
  # Return the list we just created
  temp
}

示例用法是:

myFun("mytest.txt")                  ## list output

或者

do.call(rbind, myFun("mytest.txt"))  ## Single data.frame
于 2013-07-12T02:09:51.557 回答