TL;博士
使用comboGrid
来自RcppAlgos
:
library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
细节
我最近遇到了这个问题R - Expand Grid without Duplicates并且在我搜索重复项时,我发现了这个问题。这个问题并不完全是重复的,因为它更笼统,并且有@Ferdinand.kraft 阐明的其他限制。
应该注意的是,这里的许多解决方案都使用了某种组合功能。该expand.grid
函数返回根本不同的笛卡尔积。
笛卡尔积对可能相同也可能不同的多个对象进行操作。一般来说,组合函数应用于单个向量。置换函数也是如此。
expand.grid
如果提供的向量相同,则使用组合/置换函数只会产生可比较的结果。作为一个非常简单的例子,考虑v1 = 1:3, v2 = 2:4
.
,expand.grid
我们看到第 3 行和第 5 行是重复的:
expand.grid(1:3, 2:4)
Var1 Var2
1 1 2
2 2 2
3 3 2
4 1 3
5 2 3
6 3 3
7 1 4
8 2 4
9 3 4
使用combn
并不能完全让我们找到解决方案:
t(combn(unique(c(1:3, 2:4)), 2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
并且重复使用gtools
,我们生成了太多:
gtools::combinations(4, 2, v = unique(c(1:3, 2:4)), repeats.allowed = TRUE)
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 1 4
[5,] 2 2
[6,] 2 3
[7,] 2 4
[8,] 3 3
[9,] 3 4
[10,] 4 4
事实上,我们生成的结果甚至不在笛卡尔积(即expand.grid
解)中。
我们需要一个创建以下内容的解决方案:
Var1 Var2
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 2
[5,] 2 3
[6,] 2 4
[7,] 3 3
[8,] 3 4
我编写了这个包RcppAlgos
,在最新版本v2.4.3
中,有一个函数comboGrid
可以解决这个问题。它非常通用、灵活且快速。
首先,回答OP提出的具体问题:
library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
正如@Ferdinand.kraft 指出的那样,有时输出可能需要在给定行中排除重复项。为此,我们使用repetition = FALSE
:
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), repetition = FALSE)
Var1 Var2
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"
comboGrid
也很一般。它可以应用于多个向量:
comboGrid(rep(list(c("aa", "ab", "cc")), 3))
Var1 Var2 Var3
[1,] "aa" "aa" "aa"
[2,] "aa" "aa" "ab"
[3,] "aa" "aa" "cc"
[4,] "aa" "ab" "ab"
[5,] "aa" "ab" "cc"
[6,] "aa" "cc" "cc"
[7,] "ab" "ab" "ab"
[8,] "ab" "ab" "cc"
[9,] "ab" "cc" "cc"
[10,] "cc" "cc" "cc"
不需要向量相同:
comboGrid(1:3, 2:4)
Var1 Var2
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 2
[5,] 2 3
[6,] 2 4
[7,] 3 3
[8,] 3 4
并且可以应用于各种类型的向量:
set.seed(123)
my_range <- 3:15
mixed_types <- list(
int1 = sample(15, sample(my_range, 1)),
int2 = sample(15, sample(my_range, 1)),
char1 = sample(LETTERS, sample(my_range, 1)),
char2 = sample(LETTERS, sample(my_range, 1))
)
dim(expand.grid(mixed_types))
[1] 1950 4
dim(comboGrid(mixed_types, repetition = FALSE))
[1] 1595 4
dim(comboGrid(mixed_types, repetition = TRUE))
[1] 1770 4
所采用的算法避免了生成整个笛卡尔积并随后消除了欺骗。最终,我们使用算术基本定理和重复数据删除创建了一个哈希表,正如user2357112所指出的那样,支持 Monica 在从具有重叠的池中挑选无序组合的答案中。所有这些以及它是用它编写的事实C++
意味着它快速且内存高效:
pools = list(c(1, 10, 14, 6),
c(7, 2, 4, 8, 3, 11, 12),
c(11, 3, 13, 4, 15, 8, 6, 5),
c(10, 1, 3, 2, 9, 5, 7),
c(1, 5, 10, 3, 8, 14),
c(15, 3, 7, 10, 4, 5, 8, 6),
c(14, 9, 11, 15),
c(7, 6, 13, 14, 10, 11, 9, 4),
c(6, 3, 2, 14, 7, 12, 9),
c(6, 11, 2, 5, 15, 7))
system.time(combCarts <- comboGrid(pools))
user system elapsed
0.929 0.062 0.992
nrow(combCarts)
[1] 1205740
## Small object created
print(object.size(combCarts), unit = "Mb")
92 Mb
system.time(cartProd <- expand.grid(pools))
user system elapsed
8.477 2.895 11.461
prod(lengths(pools))
[1] 101154816
## Very large object created
print(object.size(cartProd), unit = "Mb")
7717.5 Mb