49

R函数expand.grid返回所提供参数的元素之间的所有可能组合。例如

> expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
4   aa   ab
5   ab   ab
6   cc   ab
7   aa   cc
8   ab   cc
9   cc   cc

您是否知道一种有效的方法来直接(因此在 之后没有任何行比较expand.grid)仅获得所提供向量之间的“唯一”组合?输出将是

  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
5   ab   ab
6   cc   ab
9   cc   cc

编辑每个元素与其自身的组合最终可能会从答案中丢弃。我的程序中实际上并不需要它,即使(数学上)aa aa将是一个元素Var1和另一个元素之间的一个(常规)独特组合var2

该解决方案需要从两个向量中生成一对元素(即每个输入向量中的一个 - 以便它可以应用于超过 2 个输入)

4

9 回答 9

33

怎么用outer?但是这个特殊的函数将它们连接成一个字符串。

outer( c("aa", "ab", "cc"), c("aa", "ab", "cc") , "paste" )
#     [,1]    [,2]    [,3]   
#[1,] "aa aa" "aa ab" "aa cc"
#[2,] "ab aa" "ab ab" "ab cc"
#[3,] "cc aa" "cc ab" "cc cc"

combn如果您不想要重复元素(例如aa aa) ,也可以在两个向量的唯一元素上使用

vals <- c( c("aa", "ab", "cc"), c("aa", "ab", "cc") )
vals <- unique( vals )
combn( vals , 2 )
#     [,1] [,2] [,3]
#[1,] "aa" "aa" "ab"
#[2,] "ab" "cc" "cc"
于 2013-06-18T14:15:21.667 回答
20

在基础 R 中,您可以使用:

expand.grid.unique <- function(x, y, include.equals=FALSE)
{
    x <- unique(x)

    y <- unique(y)

    g <- function(i)
    {
        z <- setdiff(y, x[seq_len(i-include.equals)])

        if(length(z)) cbind(x[i], z, deparse.level=0)
    }

    do.call(rbind, lapply(seq_along(x), g))
}

结果:

> x <- c("aa", "ab", "cc")
> y <- c("aa", "ab", "cc")

> expand.grid.unique(x, y)
     [,1] [,2]
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

> expand.grid.unique(x, y, include.equals=TRUE)
     [,1] [,2]
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
于 2013-06-18T14:38:32.930 回答
17

如果两个向量相同,则包combinations中有函数gtools

library(gtools)
combinations(n = 3, r = 2, v = c("aa", "ab", "cc"), repeats.allowed = TRUE)

#      [,1] [,2]
# [1,] "aa" "aa"
# [2,] "aa" "ab"
# [3,] "aa" "cc"
# [4,] "ab" "ab"
# [5,] "ab" "cc"
# [6,] "cc" "cc"

并且没有"aa" "aa"等。

combinations(n = 3, r = 2, v = c("aa", "ab", "cc"), repeats.allowed = FALSE)
于 2013-06-18T14:26:44.913 回答
13

以前的答案缺乏获得特定结果的方法,即保留自我配对但删除具有不同顺序的配对。gtools包有两个用于这些目的的功能,combinationspermutations. 根据这个网站

  • 当顺序无关紧要时,它是一个组合。
  • 当顺序很重要时,它就是一个排列。

在这两种情况下,我们都可以决定是否允许重复,相应地,两个函数都有一个repeats.allowed参数,产生 4 种组合(美味的元!)。值得一试。为了便于理解,我将向量简化为单个字母。

重复排列

最广泛的选项是允许自我关系和不同顺序的选项:

> permutations(n = 3, r = 2, repeats.allowed = T, v = c("a", "b", "c"))
      [,1] [,2]
 [1,] "a"  "a" 
 [2,] "a"  "b" 
 [3,] "a"  "c" 
 [4,] "b"  "a" 
 [5,] "b"  "b" 
 [6,] "b"  "c" 
 [7,] "c"  "a" 
 [8,] "c"  "b" 
 [9,] "c"  "c" 

这给了我们 9 个选项。这个值可以从简单的公式中找到,n^r3^2=9。这是面向熟悉 SQL 的用户的笛卡尔积/联接。

有两种方法可以限制这一点:1)删除自我关系(不允许重复),或2)删除不同顺序的选项(即组合)。

与重复的组合

如果我们想删除不同排序的选项,我们使用:

> combinations(n = 3, r = 2, repeats.allowed = T, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "a" 
[2,] "a"  "b" 
[3,] "a"  "c" 
[4,] "b"  "b" 
[5,] "b"  "c" 
[6,] "c"  "c" 

这给了我们 6 个选项。这个值的公式是(r+n-1)!/(r!*(n-1)!)ie (2+3-1)!/(2!*(3-1)!)=4!/(2*2!)=24/4=6

无重复排列

相反,如果我们想禁止重复,我们使用:

> permutations(n = 3, r = 2, repeats.allowed = F, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "a" 
[4,] "b"  "c" 
[5,] "c"  "a" 
[6,] "c"  "b" 

这也为我们提供了 6 个选项,但不同的选项!选项的数量与上述相同,但这是巧合。该值可以从公式中找到,n!/(n-r)!(3*2*1)/(3-2)!=6/1!=6

没有重复的组合

最大的限制是当我们既不想要自我关系/重复或不同顺序的选项时,在这种情况下我们使用:

> combinations(n = 3, r = 2, repeats.allowed = F, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "c" 

这只给了我们 3 个选项。选项的数量可以从相当复杂的公式中计算出来,n!/(r!(n-r)!)3*2*1/(2*1*(3-2)!)=6/(2*1!)=6/2=3

于 2016-11-10T12:01:41.613 回答
9

尝试:

factors <- c("a", "b", "c")

all.combos <- t(combn(factors,2))

     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "c"

这将不包括每个因素的重复项(例如“a”“a”),但如果需要,您可以轻松添加它们。

dup.combos <- cbind(factors,factors)

     factors factors
[1,] "a"     "a"    
[2,] "b"     "b"    
[3,] "c"     "c"   

all.combos <- rbind(all.combos,dup.combos)

     factors factors
[1,] "a"     "b"    
[2,] "a"     "c"    
[3,] "b"     "c"    
[4,] "a"     "a"    
[5,] "b"     "b"    
[6,] "c"     "c" 
于 2014-02-28T15:59:59.820 回答
3

您可以使用“大于”操作来过滤冗余组合。这适用于数字和字符向量。

> grid <- expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), stringsAsFactors = F)
> grid[grid$Var1 >= grid$Var2, ]
  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
5   ab   ab
6   cc   ab
9   cc   cc

这不应该让你的代码减慢太多。如果您要扩展包含较大元素的向量(例如两个数据框列表),我建议使用引用原始向量的数字索引。

于 2020-02-01T16:08:32.790 回答
2

TL;博士

使用comboGrid来自RcppAlgos

library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
     Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

细节

我最近遇到了这个问题R - Expand Grid without Duplicates并且在我搜索重复项时,我发现了这个问题。这个问题并不完全是重复的,因为它更笼统,并且有@Ferdinand.kraft 阐明的其他限制。

应该注意的是,这里的许多解决方案都使用了某种组合功能。该expand.grid函数返回根本不同的笛卡尔积。

笛卡尔积对可能相同也可能不同的多个对象进行操作。一般来说,组合函数应用于单个向量。置换函数也是如此。

expand.grid如果提供的向量相同,则使用组合/置换函数只会产生可比较的结果。作为一个非常简单的例子,考虑v1 = 1:3, v2 = 2:4.

expand.grid我们看到第 3 行和第 5 行是重复的:

expand.grid(1:3, 2:4)
  Var1 Var2
1    1    2
2    2    2
3    3    2
4    1    3
5    2    3
6    3    3
7    1    4
8    2    4
9    3    4

使用combn并不能完全让我们找到解决方案:

t(combn(unique(c(1:3, 2:4)), 2))
     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    3
[5,]    2    4
[6,]    3    4

并且重复使用gtools,我们生成了太多:

gtools::combinations(4, 2, v = unique(c(1:3, 2:4)), repeats.allowed = TRUE)
      [,1] [,2]
 [1,]    1    1
 [2,]    1    2
 [3,]    1    3
 [4,]    1    4
 [5,]    2    2
 [6,]    2    3
 [7,]    2    4
 [8,]    3    3
 [9,]    3    4
[10,]    4    4

事实上,我们生成的结果甚至不在笛卡尔积(即expand.grid解)中。

我们需要一个创建以下内容的解决方案:

     Var1 Var2
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    2
[5,]    2    3
[6,]    2    4
[7,]    3    3
[8,]    3    4

我编写了这个包RcppAlgos,在最新版本v2.4.3中,有一个函数comboGrid可以解决这个问题。它非常通用、灵活且快速。

首先,回答OP提出的具体问题:

library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
     Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

正如@Ferdinand.kraft 指出的那样,有时输出可能需要在给定行中排除重复项。为此,我们使用repetition = FALSE

comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), repetition = FALSE)
     Var1 Var2
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

comboGrid也很一般。它可以应用于多个向量:

comboGrid(rep(list(c("aa", "ab", "cc")), 3))
      Var1 Var2 Var3
 [1,] "aa" "aa" "aa"
 [2,] "aa" "aa" "ab"
 [3,] "aa" "aa" "cc"
 [4,] "aa" "ab" "ab"
 [5,] "aa" "ab" "cc"
 [6,] "aa" "cc" "cc"
 [7,] "ab" "ab" "ab"
 [8,] "ab" "ab" "cc"
 [9,] "ab" "cc" "cc"
[10,] "cc" "cc" "cc"

不需要向量相同:

comboGrid(1:3, 2:4)
     Var1 Var2
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    2
[5,]    2    3
[6,]    2    4
[7,]    3    3
[8,]    3    4

并且可以应用于各种类型的向量:

set.seed(123)
my_range <- 3:15
mixed_types <- list(
    int1 = sample(15, sample(my_range, 1)),
    int2 = sample(15, sample(my_range, 1)),
    char1 = sample(LETTERS, sample(my_range, 1)),
    char2 = sample(LETTERS, sample(my_range, 1))
)

dim(expand.grid(mixed_types))
[1] 1950    4

dim(comboGrid(mixed_types, repetition = FALSE))
[1] 1595    4

dim(comboGrid(mixed_types, repetition = TRUE))
[1] 1770    4

所采用的算法避免了生成整个笛卡尔积并随后消除了欺骗。最终,我们使用算术基本定理和重复数据删除创建了一个哈希表,正如user2357112所指出的那样,支持 Monica 在从具有重叠的池中挑选无序组合的答案中。所有这些以及它是用它编写的事实C++意味着它快速且内存高效:

pools = list(c(1, 10, 14, 6),
             c(7, 2, 4, 8, 3, 11, 12),
             c(11, 3, 13, 4, 15, 8, 6, 5),
             c(10, 1, 3, 2, 9, 5,  7),
             c(1, 5, 10, 3, 8, 14),
             c(15, 3, 7, 10, 4, 5, 8, 6),
             c(14, 9, 11, 15),
             c(7, 6, 13, 14, 10, 11, 9, 4),
             c(6,  3,  2, 14,  7, 12,  9),
             c(6, 11,  2,  5, 15,  7))
             
system.time(combCarts <- comboGrid(pools))
   user  system elapsed 
  0.929   0.062   0.992

nrow(combCarts)
[1] 1205740

## Small object created
print(object.size(combCarts), unit = "Mb")
92 Mb
  
system.time(cartProd <- expand.grid(pools))
   user  system elapsed 
  8.477   2.895  11.461 
  
prod(lengths(pools))
[1] 101154816

## Very large object created
print(object.size(cartProd), unit = "Mb")
7717.5 Mb
于 2021-06-19T21:28:48.543 回答
0

这是一个非常丑陋的版本,在类似问题上对我有用。

AHP_code = letters[1:10] 
 temp. <- expand.grid(AHP_code, AHP_code, stringsAsFactors = FALSE)
  temp. <- temp.[temp.$Var1 != temp.$Var2, ] # remove AA, BB, CC, etc. 
  temp.$combo <- NA 
  for(i in 1:nrow(temp.)){  # vectorizing this gave me weird results, loop worked fine. 
    temp.$combo[i] <- paste0(sort(as.character(temp.[i, 1:2])), collapse = "")
  }
  temp. <- temp.[!duplicated(temp.$combo),]
  temp. 

于 2020-04-06T21:10:52.800 回答
0

使用排序

只是为了好玩,原则上也可以expand.grid通过组合sortand来删除重复项unique

unique(t(apply(expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc")), 1, sort)))

这给出了:

    [,1] [,2]
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"
于 2022-03-03T12:11:11.430 回答