r - 使用相似数据框的内容来提高更新大数据框内容的性能

Question

我正在寻找一种通用解决方案，用于使用第二个类似数据帧的内容更新一个大数据帧。我有几十个数据集，每个都有数千行和超过 10,000 列。“更新”数据集将与其相应的“基础”数据集重叠，按行排列从百分之几到大约 50%。数据集有一个“键”列，在任何给定的数据集中，每个唯一键值只有一行。

基本规则是：如果给定单元格的更新数据集中存在非 NA 值，则将基础数据集中的相同单元格替换为该值。（“相同单元格”表示“键”列和列名的相同值。）

请注意，更新数据集可能包含我可以使用 rbind 处理的新行（“插入”）。

因此，给定基本数据框“df1”，其中“K”列是唯一键列，“P1”..“P3”代表 10,000 列，其名称从一对数据集到下一个数据集会有所不同：

  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1

...和更新数据框“df2”：

  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2

我需要的结果如下，其中“B”和“C”的 1 被 2 覆盖，但未被 NA 覆盖：

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

这似乎不是一个合并候选者，因为合并给了我重复的行（相对于“键”列）或重复的列（例如 P1.x、P1.y），我必须对其进行迭代才能以某种方式折叠.

我已经尝试使用最终行/列的维度预先分配一个矩阵，并用 df1 的内容填充它，然后迭代 df2 的重叠行，但我无法获得优于每秒 20 个单元格的性能，需要数小时完成（与 SAS 中等效的 DATA 步 UPDATE 功能的分钟相比）。

我确定我遗漏了一些东西，但找不到可比较的例子。

我看到 ddply 用法看起来很接近，但不是通用解决方案。该data.table软件包似乎没有帮助，因为对我来说这是一个连接问题并不明显，至少通常不会超过这么多列。

此外，仅关注相交行的解决方案就足够了，因为我可以识别其他行并将它们绑定到其中。

下面是一些代码来制作上面的数据框：

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);

谢谢

score 2 · Accepted Answer

这可能不是最快的解决方案，但完全在基础上完成。

（根据汤米的评论更新答案）

#READING IN YOUR DATA FRAMES
df1 <- read.table(text="  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1", header=TRUE)

df2 <- read.table(text="  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2", header=TRUE)

all <- c(levels(df1$K), levels(df2$K))                  #all cells of key column
dups <- all[duplicated(all)]                            #the overlapping key cells
ndups <- all[!all %in% dups]                            #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows

decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ], 
    FUN = decider)) #repalce all NAs of df2 with df1 values if they exist

df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ]  #reorder based on key column
rownames(df5) <- 1:nrow(df5)  #give proper non duplicated rownames
df5

这产生：

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

仔细阅读后，并非所有列都具有相同的名称，但我假设顺序相同。这可能是一种更有用的方法：

all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
        colnames(LS[[i]]) <- colnames(LS[[2]])
        return(LS[[i]])
    }
)

LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])

decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

score 2 · Accepted Answer

这按列循环，dt1通过引用设置并且（希望）应该很快。

dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
    stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
    nna = !is.na(dt2[[i]])
    set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
     K P1 P2 P3
[1,] A  1  1  1
[2,] B  2  1  2
[3,] C  1  2  2
[4,] D  2  2  2

score 1 · Accepted Answer

编辑：请忽略这个答案。逐行循环的坏主意。它可以工作，但速度很慢。留给后人！将我的第二次尝试视为单独的答案。

require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
    k = K[i]
    p = unlist(dt2[i,-1,with=FALSE])
    p = p[!is.na(p)]
    dt1[J(k),names(p):=as.list(p),with=FALSE]
}

或者，您可以使用matrix代替data.frame吗？A[B]如果是这样，它可能是使用语法的单行whereB是一个包含要更新的行号和列号的 2 列矩阵。

score 0 · Accepted Answer

下面给出了小示例数据的正确答案，尽量减少表的“副本”数量，并使用新的 fread 和（新的？） rbindlist。它适用于您更大的实际数据集吗？我没有完全遵循原始帖子中关于您在尝试展平/规范化/堆栈时遇到的内存问题的所有评论，所以如果您已经尝试过这条路线，我们深表歉意。

library(data.table)
library(reshape2)

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")

dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table

dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")

dt1s[dt2s,value:=value.new] # Update new values

dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

r - 使用相似数据框的内容来提高更新大数据框内容的性能

4 回答 4

Related

Reference