24

I would like to create dummy variables form this dataset:

DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6", 
  "3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"), 
              row.names = c(NA, 5L), class = "data.frame")
> DF
  A                  B
1 1              1,3,2
2 2            2,1,3,6
3 3          3,2,5,1,7
4 4        3,7,4,2,6,5
5 5       4,10,7,3,5,6

Desired output shoud look like this:

A  1  2  3  4  5  6  7  8  9  10
1  1  1  1  0  0  0  0  0  0  0
2  1  1  1  0  0  1  0  0  0  0
3  1  1  1  0  1  0  1  0  0  0
4  0  1  1  1  1  1  1  0  0  0
5  0  0  1  1  1  1  1  0  0  1

Is there a efficient way to do such thing? I can use strsplit or ifelse. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy from package dummies don't work as I want to.

I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).

Thank you in advance for your anwsers.

4

7 回答 7

19

更新

此处提到的功能现在已移至 CRAN 上可用的名为“splitstackshape”的包中。CRAN 上的版本比这个原始版本快得多。速度应该与for在此答案末尾使用直接循环解决方案获得的速度相似。有关详细基准,请参阅@Ricardo 的答案。

安装它,并使用它concat.split.expanded来获得所需的结果:

library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
#   A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1    1    1    1    0    0    0    0    0    0    0
# 2 2    1    1    1    0    0    1    0    0    0    0
# 3 3    1    1    1    0    1    0    1    0    0    0
# 4 4    0    1    1    1    1    1    1    0    0    0
# 5 5    0    0    1    1    1    1    1    0    0    1

原帖

前段时间,我编写了一个函数,不仅可以进行这种拆分,还可以进行其他拆分。concat.split()可以在此处找到名为 的函数。

对于您的示例数据,用法是:

## Keeping the original column
concat.split(DF, "B", structure="expanded")
#   A            B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1        1,3,2   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2      2,1,3,6   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3    3,2,5,1,7   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  3,7,4,2,6,5  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5 4,10,7,3,5,6  NA  NA   1   1   1   1   1  NA  NA    1

## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5  NA  NA   1   1   1   1   1  NA  NA    1

将 NA 重新编码为 0 必须手动完成——也许我会更新函数以添加一个选项来执行此操作,同时实现这些更快的解决方案之一:)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1   0   0   0   0   0   0    0
# 2 2   1   1   1   0   0   1   0   0   0    0
# 3 3   1   1   1   0   1   0   1   0   0    0
# 4 4   0   1   1   1   1   1   1   0   0    0
# 5 5   0   0   1   1   1   1   1   0   0    1

更新

函数中的大部分开销concat.split可能来自于从 a 转换matrix为 a data.frame、重命名列等等。用于进行拆分的实际代码是一个GASP for循环,但测试一下,您会发现它执行得非常好:

b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)      
## Fill it in
for (i in 1:nrow(DF)) {
  m[i, temp[[i]]] = 1
}
## View your result
m 
于 2013-04-29T04:55:48.287 回答
10

更新:

在 Update2下方添加了基准
:为 @Anada 的解决方案添加了基准。哇,好快!! 为更大的数据集添加了基准,@Anada 的解决方案以更大的速度领先。'


原始答案:正如您在下面看到的,KnownMax并且UnknownMax甚至优于data.table解决方案。虽然,我怀疑如果有 10e6+ 行,那么data.table解决方案将是最快的。(随意通过简单地修改这篇文章最底部的参数来对其进行基准测试)


解决方案1:KnownMax

如果您知道 B 中的最大值,那么您就有了一个不错的两行代码:

maximum <- 10
results <- t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    1    1    1    0    0    0    0    0    0     0
# [2,]    1    1    1    0    0    1    0    0    0     0
# [3,]    1    1    1    0    1    0    1    0    0     0
# [4,]    0    1    1    1    1    1    1    0    0     0
# [5,]    0    0    1    1    1    1    1    0    0     1

三行,如果要命名列和行:

dimnames(results) <- list(seq(nrow(results)), seq(ncol(results)))

解决方案2:UnknownMax

# if you do not know the maximum ahead of time: 
splat <- strsplit(DF$B, ",")
maximum <- max(as.numeric(unlist(splat)))
t(sapply(splat, `%in%`, x=1:maximum)) + 0

解决方案3:DT

根据@dickoa 的要求,这是一个带有data.table. '

DT <- data.table(DF)

DT.long <- DT[,  list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]

cols <- DT.long[, max(vals)]
rows <- DT.long[, max(A)] 

matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols,
       byrow=TRUE, dimnames=list(seq(rows), seq(cols)))

#   1 2 3 4 5 6 7 8 9 10
# 1 1 1 1 0 0 0 0 0 0  0
# 2 1 1 1 0 0 1 0 0 0  0
# 3 1 1 1 0 1 0 1 0 0  0
# 4 0 1 1 1 1 1 1 0 0  0
# 5 0 0 1 1 1 1 1 0 0  1

类似的设置也可以在 base 中R完成

===


以下是一些数据稍大的基准:

microbenchmark(KnownMax = eval(KnownMax), UnknownMax = eval(UnknownMax),
    DT.withAssign = eval(DT.withAssign),
    DT.withOutAssign = eval(DT.withOutAssign),
    lapply.Dickoa = eval(lapply.Dickoa), apply.SimonO101 = eval(apply.SimonO101),
    forLoop.Ananda = eval(forLoop.Ananda), times=50L)

使用 OP data.frame,结果为 5 x 10

  Unit: microseconds
             expr      min       lq    median       uq       max neval
         KnownMax  106.556  114.692  122.4915  129.406  6427.521    50
       UnknownMax  114.470  122.561  128.9780  136.384   158.346    50
    DT.withAssign 3000.777 3099.729 3198.8175 3291.284 10415.315    50
 DT.withOutAssign 2637.023 2739.930 2814.0585 2903.904  9376.747    50
    lapply.Dickoa 7031.791 7315.781 7438.6835 7634.647 14314.687    50
  apply.SimonO101  430.350  465.074  487.9505  522.938  7568.442    50
   forLoop.Ananda   81.415   91.027   99.7530  104.588   265.394    50

使用稍大的 data.frame(下图),结果为 1000 x 100 删除lapply.Dickoa,因为我的编辑可能会减慢它的速度,并且它会崩溃。

   Unit: milliseconds
             expr      min       lq   median        uq       max neval
         KnownMax 34.83210 35.59068 36.13330  38.15960  52.27746    50
       UnknownMax 36.41766 37.17553 38.03075  47.71438  55.57009    50
    DT.withAssign 31.95005 32.65798 33.73578  43.71493  50.05831    50
 DT.withOutAssign 31.36063 32.08138 32.80728  35.32660  51.00037    50
  apply.SimonO101 78.61677 91.72505 95.53592 103.36052 163.14346    50
   forLoop.Ananda 13.61827 14.02197 14.18899  14.58777  26.42266    50

更大的集合,结果为 10,000 x 600

Unit: milliseconds
             expr       min        lq    median        uq       max neval
         KnownMax 1583.5902 1631.6214 1658.6168 1724.9557 1902.3923    50
       UnknownMax 1597.1215 1655.9634 1690.7550 1735.5913 1804.2156    50
    DT.withAssign  586.4675  641.7206  660.7330  716.0100 1193.4806    50
 DT.withOutAssign  587.0492  628.3731  666.3148  717.5575  776.2671    50
  apply.SimonO101 1916.6589 1995.2851 2044.9553 2079.6754 2385.1028    50
   forLoop.Ananda  163.4549  172.5627  182.6207  211.9153  315.0706    50

使用以下内容:

library(microbmenchmark)
library(data.table)

KnownMax <- quote(t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0)
UnknownMax <- quote({    splat <- strsplit(DF$B, ","); maximum <- max(as.numeric(unlist(splat))); t(sapply(splat, `%in%`, x=1:maximum)) + 0})
DT.withAssign <- quote({DT <- data.table(DF); DT.long <- DT[,  list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
DT.withOutAssign <- quote({DT.long <- DT[,  list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
lapply.Dickoa <- quote({ tmp <- strsplit(DF$B, ","); label <- 1:max(as.numeric(unlist(tmp))); tmp <- lapply(tmp, function(x) as.data.frame(lapply(label, function(y) (x == y)))); unname(t(sapply(tmp, colSums))) })
apply.SimonO101 <- quote({cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))));  t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) ) })
forLoop.Ananda <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol)      ; for (i in 1:nrow(DF)) {  m[i, temp[[i]]] = 1 }; m })

# slightly modified @Dickoa's alogrithm to allow for instances were B is only a single number.  
#  Instead of using `sapply(.)`, I used `as.data.frame(lapply(.))` which hopefully the simplification process in sapply is analogous in time to `as.data.frame`

identical(eval(lapply.Dickoa), eval(UnknownMax))
identical(eval(lapply.Dickoa), unname(eval(apply.SimonO101)))
identical(eval(lapply.Dickoa), eval(KnownMax))
identical(unname(as.matrix(eval(DT.withAssign))), eval(KnownMax))
# ALL TRUE

这是用于创建示例数据的内容:

# larger data created as follows
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF); 
DT
于 2013-04-29T03:26:34.440 回答
5

你可以用ifelseand来做到这一点的一种方法strsplit(除非我误解了并且你不想使用它们?)就像这样......

cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )

colnames(df) <- cols
df
#  1 2 3 4 5 6 7 8 9 10
#1 1 1 1 0 0 0 0 0 0  0
#2 1 1 1 0 0 1 0 0 0  0
#3 1 1 1 0 1 0 1 0 0  0
#4 0 1 1 1 1 1 1 0 0  0
#5 0 0 1 1 1 1 1 0 0  1

我们的想法是,我们在您想要的列中获取唯一值的向量,找到该max值并创建一个向量1:max(value),然后应用于每一行以找出该行的哪些值在所有值的向量中。如果它存在,我们使用ifelse1,如果不存在,我们使用 0。we match in 是一个序列,vector因此它的输出已准备好排序。

于 2013-04-28T20:46:39.607 回答
4

游戏有点晚了,但不同的策略使用了这样一个事实,即矩阵可以由另一个两列矩阵索引,指定行和列索引以进行更新。所以

f2 <- function(DF) {
    b <- strsplit(DF$B, ",", fixed=TRUE)
    len <- vapply(b, length, integer(1)) # 'geometry'
    b <- as.integer(unlist(b))

    midx <- matrix(c(rep(seq_len(nrow(DF)), len), b), ncol=2)
    m <- matrix(0L, nrow(DF), max(b))
    m[midx] <- 1L
    m
}

这使用strsplit(..., fixed=TRUE)andvapply来提高效率和类型安全, and as.integer0L因为1L我们真的想要整数而不是数字返回值。

为了比较,这是@AnandaMahto 的原始实现

f0 <- function(DF) {
    b = strsplit(DF$B, ",")
    ncol = max(as.numeric(unlist(b)))
    temp = lapply(b, as.numeric)
    m = matrix(0, nrow = nrow(DF), ncol = ncol)
    for (i in 1:nrow(DF)) {
        m[i, temp[[i]]] = 1
    }
    m
}

这可以通过使用fixed=TRUE和避免 的双重强制来提高效率b,并通过强制为整数和使用seq_len(nrow(DF))来避免 0-row DF 的极端情况变得更加健壮

f1 <- function(DF) {
    b = lapply(strsplit(DF$B, ",", fixed=TRUE), as.integer)
    ncol = max(unlist(b))
    m = matrix(0L, nrow = nrow(DF), ncol = ncol)      
    for (i in seq_len(nrow(DF)))
        m[i, b[[i]]] = 1L
    m
}

for 循环是一个很好的编译候选者,所以

library(compiler)
f1c <- cmpfun(f1)

然后比较来自@RicardoSaporta 的 10,000 x 600 数据

> library(microbenchmark)
> microbenchmark(f0(DF), f1(DF), f1c(DF), f2(DF))
Unit: milliseconds
    expr       min        lq    median        uq      max neval
  f0(DF) 170.51388 180.25997 182.45772 188.23811 717.7511   100
  f1(DF)  91.53578  97.14909  97.97195 100.24236 447.5900   100
 f1c(DF)  79.39194  84.45712  85.71022  87.85763 411.8340   100
  f2(DF)  76.45496  81.70307  82.50752 110.83620 398.6093   100

从 f0 到 f1 的 2 倍增长和 for 循环的相对效率都让我感到相对惊讶。@AnandaMahto 的解决方案更节省内存,在没有太多性能成本的情况下做得更多

ncol = max(vapply(b, max, integer(1)))
于 2013-06-09T16:11:13.697 回答
3

我知道已经有一个很好且非常有效的答案,但我们也可以使用另一种方法来获得相同的结果。

tmp <- strsplit(DF$B, ",")
label <- 1:max(as.numeric(unlist(tmp)))
tmp <- lapply(tmp, function(x)
              sapply(label, function(y) (x == y)))

t(sapply(tmp, colSums))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    1    0    0    0    0    0    0     0
## [2,]    1    1    1    0    0    1    0    0    0     0
## [3,]    1    1    1    0    1    0    1    0    0     0
## [4,]    0    1    1    1    1    1    1    0    0     0
## [5,]    0    0    1    1    1    1    1    0    0     1

我们现在可以对其进行基准测试以与@SimonO101 解决方案 (fun2) 进行比较

require(rbenchmark)

fun1 <- function(DF) {
    tmp <- strsplit(DF$B, ",")
    label <- 1:max(as.numeric(unlist(tmp)))
    tmp <- lapply(tmp, function(x)
                  sapply(label, function(y) (x == y)))
    t(sapply(tmp, colSums))

}


fun2 <- function(DF) {
    cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
    df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )

    colnames(df) <- cols
    df
}


all.equal(fun1(DF),
          fun2(DF),
          check.attributes = FALSE)

## [1] TRUE


benchmark(fun1(DF),
          fun2(DF),
          order = "elapsed",
          columns = c("test", "elapsed", "relative"),
          replications = 5000)


##       test elapsed relative
## 1 fun1(DF)   1.870    1.000
## 2 fun2(DF)   2.018    1.079

正如我们所看到的,差别不大。


建议编辑(RS):

# from: 
tmp <- lapply(tmp, function(x)
           sapply(label, function(y) (x == y)))

#  to: 
tmp <- lapply(tmp, function(x)
          as.data.frame(lapply(label, function(y) (x == y))))
于 2013-04-28T21:50:26.577 回答
3

好的,这一直困扰着我一段时间,但我认为这将是一个很好的使用Rcpp。所以我也写了一个小函数,看看我能不能得到比@Ananda 惊人的for循环解决方案更快的东西。该解决方案的运行速度似乎快了大约两倍(使用@RicardoSaporta 发布的更大的样本数据集)。

注意:我尝试这样做更多是为了自学如何使用 Rcpp 和 C++,而不是提供有用的解决方案,但都是一样的......

我们的.cpp档案...

#include <Rcpp.h>
#include <string>
#include <sstream>

using namespace Rcpp;

//[[Rcpp::export]]

NumericMatrix expandR(CharacterVector x) {
    int n = x.size();
    std::vector< std::vector<int> > out;    // list to hold numeric vectors
    int tmax = 0;
    for(int i = 0; i < n; ++i) {
      std::vector<int> vect;                // vector to hold split strings
      std::string str = as<std::string>(x[i]);
      std::stringstream ss(str);
      int j = 0;
      while (ss >> j) {
      vect.push_back(j);  // add integer to result vector
        if (ss.peek() == ',') //split by ',' delim
          ss.ignore();
      }
     int it = *std::max_element(vect.begin(), vect.end());
      if( it > tmax )
        tmax = it;  //current max value
      out.push_back(vect);
    }
// Now we construct the matrix. tmax gives us number of columns, n is number of rows;
    NumericMatrix mat(n,tmax);
    for( int i = 0; i < n; ++i) {
      NumericMatrix::Row zzrow = mat( i , _ );
      std::vector<int> vec = out[i];
      for( int j = 0; j < vec.size(); ++j ) {
        zzrow[ (vec[j]-1) ] = 1; //don't forget R vs. C++ indexing
        }
    }
    return mat;
}

使用 OP 中的名义示例,我们可以这样做......

require(Rcpp)

##  source the function so it is available to use in R
sourceCpp("C:/path/to/file.cpp")

#  Call it like any other R function
expandR(DF$B)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    1    1    0    0    0    0    0    0     0
[2,]    1    1    1    0    0    1    0    0    0     0
[3,]    1    1    1    0    1    0    1    0    0     0
[4,]    0    1    1    1    1    1    1    0    0     0
[5,]    0    0    1    1    1    1    1    0    0     1

并使用@Ricardo 提供的更大数据集)并与@Ananda 的解决方案进行比较)......

require(Rcpp)
require(data.table)
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF); 
DT

##  source in our c code
sourceCpp("C:/Users/sohanlon/Desktop/expandR2.cpp")

forLoop.Ananda  <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol)      ; for (i in 1:nrow(DF)) {  m[i, temp[[i]]] = 1 }; m })
rcpp.Simon      <- quote({mm = expandR( DT$B )})

require(microbenchmark)
microbenchmark( eval(forLoop.Ananda) , eval(rcpp.Simon) , times = 5L )
Unit: milliseconds
                 expr      min       lq   median       uq      max neval
 eval(forLoop.Ananda) 173.3024 178.6445 181.5881 218.9619 227.9490     5
     eval(rcpp.Simon) 115.8309 116.3876 116.8125 119.1971 125.6504     5
于 2013-08-13T10:30:13.887 回答
0

然而,这不是一个特别快速的解决方案,它可能对那些喜欢tidyverse可能性的人有用:

DF %>%
 mutate(B = str_split(B, fixed(","))) %>%
 unnest() %>%
 transmute(A,
           var = as.numeric(B),
           val = 1) %>%
 complete(var = seq(min(var), max(var), 1), nesting(A)) %>%
 spread(var, val, fill = 0)

      A   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     0     0     0     0     0     0     0
2     2     1     1     1     0     0     1     0     0     0     0
3     3     1     1     1     0     1     0     1     0     0     0
4     4     0     1     1     1     1     1     1     0     0     0
5     5     0     0     1     1     1     1     1     0     0     1

要拥有更紧凑的列名:

DF %>%
 mutate(B = str_split(B, fixed(","))) %>%
 unnest() %>%
 transmute(A,
           var = as.numeric(B),
           val = 1) %>%
 complete(var = seq(min(var), max(var), 1), nesting(A)) %>%
 spread(var, val, fill = 0) %>%
 rename_at(2:length(.), ~ paste0("Col", 1:length(.)))

      A  Col1  Col2  Col3  Col4  Col5  Col6  Col7  Col8  Col9 Col10
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     0     0     0     0     0     0     0
2     2     1     1     1     0     0     1     0     0     0     0
3     3     1     1     1     0     1     0     1     0     0     0
4     4     0     1     1     1     1     1     1     0     0     0
5     5     0     0     1     1     1     1     1     0     0     1
于 2019-07-19T08:49:51.483 回答