正如我在评论中提到的,如果您的数据是平衡的(也就是说,您希望在拆分数据后得到一个漂亮的矩形数据集),您应该查看我的concat.split.DT
函数。
这里有一些测试。
Sven 的数据,但有 20K 行而不是 2
dat <- do.call(rbind, replicate(1e4, dat, simplify=FALSE))
dim(dat)
# [1] 20000 1
“stringr”函数可能有点慢:
library(stringr)
system.time(do.call(rbind, str_split(dat$a, "/")))
# user system elapsed
# 3.194 0.000 3.211
但是其他解决方案的表现如何?
fun1 <- function() concat.split.multiple(dat, "a", "/")
fun2 <- function() do.call(rbind, strsplit(dat$a, "/", fixed=TRUE))
## ^^ fixed = TRUE will make a big difference
fun3 <- function() concat.split.DT(dat, "a", "/")
library(microbenchmark)
microbenchmark(fun1(), fun2(), fun3(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 530.46597 534.13486 535.19139 538.91488 553.61919 10
# fun2() 30.22265 31.07287 31.81474 32.93936 40.28859 10
# fun3() 22.57517 22.94169 23.10297 23.30907 31.97640 10
concat.split.multiple
因此,对于常规(仅在引擎盖下使用)来说,这大约是半秒,而对于and (后者在引擎盖下使用“data.table” read.table
)来说,效果要好得多。strsplit
concat.split.DT
fread
让我们进一步扩大规模,现在达到 100 万行......
dat <- do.call(rbind, replicate(50, dat, simplify=FALSE))
dim(dat)
# [1] 1000000 1
microbenchmark(fun2(), fun3(), times = 5)
# Unit: seconds
# expr min lq median uq max neval
# fun2() 6.257892 6.522199 13.728283 13.934860 14.277432 5
# fun3() 1.671739 1.830485 2.203076 2.470872 2.572917 5
该concat.split.DT
方法的优点是使用简单的语法可以方便地拆分多个列:
dat2 <- do.call(cbind, replicate(5, dat, simplify = FALSE))
dim(dat2)
# [1] 1000000 5
names(dat2) <- make.unique(names(dat2))
head(dat2)
# a a.1 a.2 a.3 a.4
# 1 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 2 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
# 3 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 4 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
# 5 a/b/c/d a/b/c/d a/b/c/d a/b/c/d a/b/c/d
# 6 e/f/g/h e/f/g/h e/f/g/h e/f/g/h e/f/g/h
现在,让我们一次将它们全部拆分:
system.time(out <- concat.split.DT(dat2, names(dat2), "/"))
# user system elapsed
# 6.260 0.040 6.532
out
# a_1 a_2 a_3 a_4 a.1_1 a.1_2 a.1_3 a.1_4 a.2_1 a.2_2 a.2_3 a.2_4 a.3_1
# 1: a b c d a b c d a b c d a
# 2: e f g h e f g h e f g h e
# 3: a b c d a b c d a b c d a
# 4: e f g h e f g h e f g h e
# 5: a b c d a b c d a b c d a
# ---
# 999996: e f g h e f g h e f g h e
# 999997: a b c d a b c d a b c d a
# 999998: e f g h e f g h e f g h e
# 999999: a b c d a b c d a b c d a
# 1000000: e f g h e f g h e f g h e
# a.3_2 a.3_3 a.3_4 a.4_1 a.4_2 a.4_3 a.4_4
# 1: b c d a b c d
# 2: f g h e f g h
# 3: b c d a b c d
# 4: f g h e f g h
# 5: b c d a b c d
# ---
# 999996: f g h e f g h
# 999997: b c d a b c d
# 999998: f g h e f g h
# 999999: b c d a b c d
# 1000000: f g h e f g h