5

Firstly I'm pretty sure this has been answered before but the search terms seem difficult to hit, apologies if there is a duplicate out there.

Say I have a vector of factors:

all <- factor(letters)

And I've gone on to use all combinations of those factor levels as part of a modelling pipeline:

combos <- t(combn(as.character(all), 5))
head(combos)
#     [,1] [,2] [,3] [,4] [,5]
# [1,] "a"  "b"  "c"  "d"  "e" 
# [2,] "a"  "b"  "c"  "d"  "f" 
# [3,] "a"  "b"  "c"  "d"  "g" 
# ...

My question is: How can I convert this second matrix to one showing presence/absence of all levels, like:

      a   b   c   d   e   f   g  ...
[1,]  1   1   1   1   1   0   0  ...
[2,]  1   1   1   1   0   1   0  ...
[3,]  1   1   1   1   0   0   1  ...
...

In terms of what I've tried, my first thought was a row-wise application of ifelse using apply, but I haven't been able to put anything workable together. Any smart way of doing this?

4

3 回答 3

3

Here's my attempt:

combos.out <- t(apply(combos, 1, function(x) table(factor(x, levels = letters))))
head(combos.out)
#      a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

WRT @Ananda Mahto's comment, the manipulation through transformation and factorisation definitely slows things down - a quick and dirty benchmark:

#Unit: milliseconds
#             expr        min         lq     median         uq        max neval
#   forfun(combos)   416.6027   534.6973   652.7919   718.4231   784.0544     3
# applyfun(combos) 13892.7020 15755.8570 17619.0121 22559.8271 27500.6421     3

Score one for the for loop!

于 2013-09-30T12:45:48.523 回答
3

Update: An even better solution

You can use matrix indexing to get even better speeds. Here is a much improved solution that does not use a for loop.

all <- factor(letters)
combos <- t(combn(as.character(all), 5))
A <- match(c(t(combos)), letters)
B <- 0:(length(A)-1) %/% 5 + 1
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
            dimnames = list(NULL, a))
x[cbind(B, A)] <- 1L

Benchmarks

orig <- function() {
  a <- unique(as.vector(combos))
  x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
              dimnames = list(NULL, a))
  for (i in 1:nrow(combos)) {
    x[i, combos[i, ]] <- 1
  }
  x
}

new <- function() {
  A <- match(c(t(combos)), letters)
  B <- 0:(length(A)-1) %/% 5 + 1
  a <- unique(as.vector(combos))
  x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
              dimnames = list(NULL, a))
  x[cbind(B, A)] <- 1L
  x
}

identical(orig(), new())
# [1] TRUE

library(microbenchmark)
microbenchmark(orig(), new(), times = 20)
# Unit: milliseconds
#    expr       min        lq    median       uq      max neval
#  orig() 476.85206 486.11091 497.48429 512.4333 579.2695    20
#   new()  87.02026  91.17021  96.88463 111.6414 175.6339    20

Original answer

In a problem like this, a for loop would work just fine and can be easily preallocated:

a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
            dimnames = list(NULL, a))

for (i in 1:nrow(combos)) {
  x[i, combos[i, ]] <- 1
}

head(x)
#      a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
于 2013-09-30T12:51:42.427 回答
2

A simple, and pretty efficient solution:

t(apply(combos,1,function(x){all %in% x}))*1

The for loop solution by Ananda Mahto is still about twice as fast:

      min       lq  median       uq      max neval
 561.2153 638.4648 643.439 650.7053 1199.857   100

versus

      min       lq   median       uq      max neval
 295.8798 305.0586 311.9961 370.6028 406.9336   100
于 2013-09-30T13:34:32.547 回答