我正在寻找一种从表中删除所有主导行的快速方法(最好使用并行处理,以利用多个内核)。
“支配行”是指在所有列中小于或等于另一行的行。例如,在下表中:
tribble(~a, ~b, ~c,
10, 5, 3,
10, 4, 2,
1, 4, 1,
7, 3, 6)
第 2 行和第 3 行是被支配的行(在这种情况下,它们都被第 1 行支配),应该被删除。第 1 行和第 4 行不受任何其他行的支配,应保留,结果如下表:
tribble(~a, ~b, ~c,
10, 5, 3,
7, 3, 6)
为了进一步说明,这是我希望加速的代码类型:
table1 = as_tibble(replicate(3, runif(500000)))
colnames(table1) = c("a", "b", "c")
table2 = table1
for (i in 1:nrow(table1)) {
table2 = filter(table2,
(a > table1[i,]$a | b > table1[i,]$b | c > table1[i,]$c) |
(a == table1[i,]$a & b == table1[i,]$b & c == table1[i,]$c) )
}
filtered_table = table2
我有一些想法,但我想我会问是否有众所周知的包/功能可以做到这一点。
更新:这是上述代码的一个相当简单的并行化,但它提供了可靠的性能提升:
remove_dominated = function(table) {
ncores = detectCores()
registerDoParallel(makeCluster(ncores))
# Divide the table into parts and remove dominated rows from each part
tfref = foreach(part=splitIndices(nrow(table), ncores), .combine=rbind) %dopar% {
tpref = table[part[[1]]:part[[length(part)]],]
tp = tpref
for (i in 1:nrow(tpref)) {
tp = filter(tp,
(a > tpref[i,]$a | b > tpref[i,]$b | c > tpref[i,]$c |
(a == tpref[i,]$b & b == tpref[i,]$b & c == tpref[i,]$c) )
}
tp
}
# After the simplified parts have been concatenated, run a final pass to remove dominated rows from the full table
t = tfref
for (i in 1:nrow(tfref)) {
t = filter(t,
(a > tfref[i,]$a | b > tfref[i,]$b | c > tfref[i,]$c |
(a == tfref[i,]$a & b == tfref[i,]$b & c == tfref[i,]$c) )
}
return(t)
}