library(data.table)
set.seed(1350)
# Create an example data table:
dt <- data.table(ID=1:200,H=sample(1:1000,200),L=sample(1001:2000,200),key="ID")
# (If you already have a data frame 'df', you can use):
# dt <- as.data.table(df)
set.seed(5655)
# Add a column that randomly samples between H and L:
dt[,HL:=sample(c(H,L),1),by=ID]
dt
# ID H L HL
# 1: 1 837 1391 1391
# 2: 2 999 1573 1573
# 3: 3 566 1275 566
# 4: 4 347 1709 1709
# 5: 5 129 1627 129
# ---
#196: 196 67 1879 1879
#197: 197 652 1811 1811
#198: 198 569 1160 1160
#199: 199 17 1026 17
#200: 200 221 1500 1500
编辑 2:如评论中所指出的,如果 H 有重复项,我的初始答案将给出不正确的值。正如评论中所建议的那样,我添加了显示data.table
更快的时间,但是当我更正答案时,它确实要慢得多。(错误答案更快,因为它是按重复值分组的,所以要考虑的行要少得多......)
简而言之,我错了,你可能会更好地选择另一个答案。
这是一个适当的基准:
set.seed(1350)
H <- sample(1:200, 200)
L <- sample(201:400, 200)
usingDataTable <- quote({
dt <- data.table(H, L)
dt[,HL:=sample(c(H,L),1),by=H]
})
dt2 <- data.table(H, L)
usingDataTable.NoInitialize <- quote({
dt2[,HL:=sample(c(H,L),1),by=H]
})
usingVectors <- quote ({
ifelse( rbinom(H, 1, 0.5), H, L)
})
microbenchmark(eval(usingVectors), eval(usingDataTable), eval(usingDataTable.NoInitialize), times=100L)
Unit: microseconds
expr min lq median uq max neval
eval(usingVectors) 55.021 61.148 66.760 69.4605 1682.163 100
eval(usingDataTable) 1635.676 1745.437 1795.245 1851.0950 3629.179 100
eval(usingDataTable.NoInitialize) 1458.573 1537.618 1596.237 1669.3750 3683.756 100