我有两个这种格式的data.tables(实际的表格每个都有大约一百万行):
library(data.table)
dt1 <- data.table(
code=c("A001", "A002","A003","A004","A005"),
x=c(65,92,25,450,12),
y=c(98,506,72,76,15),
no1=c(010101, 010156, 028756, 372576,367383),
no2=c(876362,"",682973,78269,"")
)
dt2 <- data.table(
code=c("A003", "A004","A005","A006","A007","A008","A009"),
x=c(25,126,12,55,34,134,55),
y=c(72,76,890,568,129,675,989),
no1=c(028756, 372576,367383,234876, 287156, 123348, 198337),
no2=c(682973,78269,65378,"","","",789165)
)
我想将两者结合在一起,并根据所有列条目的唯一性仅保留唯一行。这就是我所拥有的,但我认为有更好的方法:
dt3 <- rbindlist(list(dt1, dt2))
dt3 <- unique(dt3, by = c("code", "x", "y", "no1", "no2"))
一旦我有了这个单一的数据集,我想给任何重复的“代码”记录一些属性信息(版本号和关于该版本与前一个版本不同之处的评论)。我正在寻找的输出是这样的:
dt4 <- data.table(
code=c("A001", "A002","A003","A004","A005", "A004","A005","A006","A007","A008","A009"),
x=c(65,92,25,450,12,126,12,55,34,134,55),
y=c(98,506,72,76,15,76,890,568,129,675,989),
no1=c(010101, 010156, 028756, 372576,367383, 372576,367383,234876, 287156, 123348, 198337),
no2=c(876362,"",682973,78269,"",78269,65378,"","","",789165),
version = c("V1","V1","V1","V1","V1","V2","V2","V1","V1","V1","V1"),
unique_version=c("A001_V1", "A002_V1","A003_V1","A004_V1","A005_V1", "A004_V2","A005_V2","A006_V1","A007_V1","A008_V1","A009_V1"),
comment = c("First_entry","First_entry","First_entry","First_entry","First_entry","New_x", "New_y_and_no2","First_entry","First_entry","First_entry","First_entry")
)
我不确定如何实现dt4
(考虑到真实数据集的大小将超过一百万行,并且以一种有效的方式)。
编辑
将@Chase 的解决方案应用于我的真实数据后,我注意到我的 dt3 示例与我得到的结果类型略有不同。这看起来更像我的真实数据:
dt6 <- data.table(
code=c("A111", "A111","A111","A111","A111", "A111","A111","A234", "A234","A234","A234","A234", "A234","A234"),
x=c("",126,126,"",836,843,843,126,126,"",127,836,843,843),
y=c("",76,76,"",456,465,465,76,76,"",77,456,465,465),
no1=c(028756, 028756,028756,057756, 057756, 057756, 057756,028756, 028756,057756,057756, 057756, 057756, 057756),
no2=c("","",034756,"","","",789165,"",034756,"","","","",789165)
)
comp_cols <- c("x", "y", "no1", "no2")
#grabs the names of the mismatching values and formats them how you did
f <- function(x,y) {
n_x <- names(x)
diff <- x != y
paste0("New_", paste0(n_x[diff], collapse = "_and_"))
}
dt6[, version := paste0("V", 1:.N), by = code]
dt6[, unique_version := paste(code, version, sep = "_")]
dt6[, comment := ifelse(version == "V1", "First_entry", f(.SD[1], .SD[2])), by = code, .SDcols = comp_cols]
如您所见,创建评论列的建议解决方案似乎只返回第一个和第二个版本之间的第一个更改(而不是 V2 和 V3 等更好的更改)