4

我有一组包含空格分隔元素的字符串。我想构建一个矩阵,它会告诉我哪些元素是哪些字符串的一部分。例如:

""
"A B C"
"D"
"B D"

应该给出如下内容:

  A B C D
1
2 1 1 1
3       1
4   1   1

现在我有了一个解决方案,但它像糖蜜一样运行缓慢,而且我已经没有关于如何让它更快的想法:

reverseIn <- function(vector, value) {
    return(value %in% vector)
}

buildCategoryMatrix <- function(valueVector) {
    allClasses <- c()
    for(classVec in unique(valueVector)) {
        allClasses <- unique(c(allClasses,
                               strsplit(classVec, " ", fixed=TRUE)[[1]]))
    }

    resMatrix <- matrix(ncol=0, nrow=length(valueVector))
    splitValues <- strsplit(valueVector, " ", fixed=TRUE)

    for(cat in allClasses) {
        if(cat=="") {
            catIsPart <- (valueVector == "")
        } else {
            catIsPart <- sapply(splitValues, reverseIn, cat)
        }
        resMatrix <- cbind(resMatrix, catIsPart)
    }
    colnames(resMatrix) <- allClasses

    return(resMatrix)
}

分析函数给了我这个:

$by.self
                  self.time self.pct total.time total.pct
"match"               31.20    34.74      31.24     34.79
"FUN"                 30.26    33.70      74.30     82.74
"lapply"              13.56    15.10      87.86     97.84
"%in%"                12.92    14.39      44.10     49.11

所以我的实际问题是: - 花在“乐趣”上的 33% 来自哪里?- 有什么方法可以加快 %in% 的通话速度吗?

我尝试在进入循环之前将字符串转换为因子,以便匹配数字而不是字符串,但这实际上会使 R 崩溃。我还尝试进行部分矩阵分配(IE,resMatrix[i,x] <- 1),其中 i 是字符串的编号,x 是因子的向量。那里也没有骰子,因为它似乎一直在无限运行。

4

3 回答 3

5

在我的“splitstackshape”包的开发版本中,有一个名为的辅助函数charBinaryMat可以用于这样的事情:

这是函数(因为 CRAN 上的包版本还没有):

charBinaryMat <- function(listOfValues, fill = NA) {
  lev <- sort(unique(unlist(listOfValues, use.names = FALSE)))
  m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev))
  colnames(m) <- lev
  for (i in 1:nrow(m)) {
    m[i, listOfValues[[i]]] <- 1
  }
  m
}

输入应为 的输出strsplit

在这里它正在使用中:

str <- c("" , "A B C" , "D" , "B D" )

## Fill is `NA` by default
charBinaryMat(strsplit(str, " ", fixed=TRUE))
#       A  B  C  D
# [1,] NA NA NA NA
# [2,]  1  1  1 NA
# [3,] NA NA NA  1
# [4,] NA  1 NA  1

## Can easily be set to another value
charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
#      A B C D
# [1,] 0 0 0 0
# [2,] 1 1 1 0
# [3,] 0 0 0 1
# [4,] 0 1 0 1

基准测试

由于您的问题是关于更快的方法,让我们进行基准测试。

  1. 基准测试的功能:

    CBM <- function() {
      charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
    }
    BCM <- function() {
      buildCategoryMatrix(str)*1L
    }
    Sapply <- function() {
      y <- unique( unlist( strsplit( str , " " ) ) )
      out <- t(sapply(str, function(x) y %in% unlist(strsplit(x , " " )),
                      USE.NAMES = FALSE )) * 1L
      colnames(out) <- y
      out
    }
    
  2. 一些样本数据:

    set.seed(1)
    A = sample(10, 100000, replace = TRUE)
    str <- sapply(seq_along(A), function(x)
      paste(sample(LETTERS[1:10], A[x]), collapse = " "))
    head(str)
    # [1] "H G C"               "F H J G"             "H D J A I B"        
    # [4] "A C F H J B E G D I" "F C H"               "I C G B J D F A E" 
    
  3. 一些示例输出:

    ## Automatically sorted
    head(CBM())
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    ## Sorting just for comparison
    head(BCM())[, LETTERS[1:10]]
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
    ## Sorting just for comparison
    head(Sapply())[, LETTERS[1:10]]
    #      A B C D E F G H I J
    # [1,] 0 0 1 0 0 0 1 1 0 0
    # [2,] 0 0 0 0 0 1 1 1 0 1
    # [3,] 1 1 0 1 0 0 0 1 1 1
    # [4,] 1 1 1 1 1 1 1 1 1 1
    # [5,] 0 0 1 0 0 1 0 1 0 0
    # [6,] 1 1 1 1 1 1 1 0 1 1
    
  4. 基准测试:

    library(microbenchmark)
    microbenchmark(CBM(), BCM(), Sapply(), times=20)
    # Unit: milliseconds
    #      expr        min         lq     median         uq        max neval
    #     CBM()   675.0929   718.3454   777.2423   805.3872   858.6609    20
    #     BCM() 11059.6305 11267.9888 11367.3283 11595.1758 11792.5950    20
    #  Sapply()  3536.7755  3687.0308  3759.7388  3813.4233  3968.3192    20
    
于 2013-10-25T16:18:53.667 回答
4

这很容易做到vapply

x <- c("" , "A B C" , "D" , "B D" )
lines <- strsplit(x, " ", fixed = TRUE)

all <- sort(unique(unlist(lines)))

t(vapply(lines, function(x) all %in% x, numeric(length(all))))  

这比@Ananda 的方法慢一点:https ://gist.github.com/hadley/7169138

于 2013-10-26T12:52:35.337 回答
2

Here's one way of doing this. There is a lot going on in the line where out is assigned. Basically, we loop over each element of your input vector. We split each element into individual characters, then we look to see which of these is present in a vector of all the unique values in your dataset. This returns either TRUE or FALSE. We use * 1L at the end to turn logical values into integer but you could just wrap the whole thing in as.integer instead. sapply returns the results column-wise but you want them row-wise so we use the transpose function t() to achieve this.

The final line converts to a data.frame and applies column names.

#  Data
str <- c("" , "A B C" , "D" , "B D" )

#  Unique column headers (excluding empty strings as in example)
y <- unique( unlist( strsplit( str , " " ) ) )

#  Results
out <- t( sapply( str , function(x) y %in% unlist( strsplit( x , " " ) ) , USE.NAMES = FALSE ) ) * 1L

#  Combine to a data.frame
setNames( data.frame( out ) , y )
#  A B C D
#1 0 0 0 0
#2 1 1 1 0
#3 0 0 0 1
#4 0 1 0 1
于 2013-10-25T16:10:23.640 回答