1

给定两个因素(每个因素都有相同的水平),比如说

lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot")
A <- factor(sample(lev, 6000, TRUE))
B <- factor(sample(lev, 6000, TRUE))

我想使用他们的外部产品,具有自定义产品功能,定义如下:

mapping <- matrix(c(
    "green", "blue",  "blue",  "red",    "red",    "red",
    "blue",  "green", "blue",  "red",    "red",    "red",
    "blue",  "blue",  "green", "red",    "red",    "red",
    "red",   "red",   "red",   "green",  "yellow", "red",
    "red",   "red",   "red",   "yellow", "green",  "red",
    "red",   "red",   "red",   "red",    "red",    "green"),
    nrow=6, ncol=6,
    dimnames=list(lev, lev))
mapper <- function (X, Y) mapping[matrix(c(levels(X)[X], levels(Y)[Y]),
                                         ncol=2, byrow=TRUE)]
A.B <- outer(A, B, FUN=mapper)

预期输出(对于大大减少的测试用例)应该是这样的

> A
[1] alpha   foxtrot echo    charlie echo    foxtrot bravo   delta   charlie
Levels: alpha bravo charlie delta echo foxtrot
> B
[1] alpha   foxtrot delta   bravo   bravo   alpha   alpha   bravo   alpha  
Levels: alpha bravo delta foxtrot
> outer(A, B, mapper)
      [,1]   [,2]   [,3]   [,4]   [,5]    [,6]    [,7]    [,8]    [,9]   
 [1,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [2,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [3,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [4,] "red"  "red"  "red"  "red"  "red"   "green" "green" "green" "green"
 [5,] "blue" "blue" "blue" "blue" "blue"  "red"   "green" "green" "blue" 
 [6,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"
 [7,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"
 [8,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"
 [9,] "red"  "red"  "red"  "red"  "green" "green" "green" "green" "green"

这行得通,但在全范围内它慢得令人不快:

> system.time(outer(A, B, mapper))
   user  system elapsed 
 11.381   5.015  17.653 

谁能推荐一个更快的方法?如果有帮助,映射矩阵保证是三角形的(即mapping[a,b] == mapping[b,a]∀ a,b。)

4

3 回答 3

4

编辑:在我回答时,问题似乎发生了巨大变化,但无论如何都要留在这里。


我假设@joran 的评论是正确的,而您的意思是(并在 中修复顺序lev

lev <- c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)

此外,mapping不是您似乎认为的二维数组(矩阵)或嵌套数据结构(列表列表)

> mapping
    alpha.alpha     alpha.bravo   alpha.charlie     alpha.delta      alpha.echo 
        "green"          "blue"          "blue"           "red"           "red" 
  alpha.foxtrot     bravo.alpha     bravo.bravo   bravo.charlie     bravo.delta 
          "red"          "blue"         "green"          "blue"           "red" 
     bravo.echo   bravo.foxtrot   charlie.alpha   charlie.bravo charlie.charlie 
          "red"           "red"          "blue"          "blue"         "green" 
  charlie.delta    charlie.echo charlie.foxtrot     delta.alpha     delta.bravo 
          "red"           "red"           "red"           "red"           "red" 
  delta.charlie     delta.delta      delta.echo   delta.foxtrot      echo.alpha 
          "red"         "green"        "yellow"           "red"           "red" 
     echo.bravo    echo.charlie      echo.delta       echo.echo    echo.foxtrot 
          "red"           "red"        "yellow"           "red"           "red" 
  foxtrot.alpha   foxtrot.bravo foxtrot.charlie   foxtrot.delta    foxtrot.echo 
          "red"           "red"           "red"           "red"           "red" 
foxtrot.foxtrot 
        "green" 

现在,如果您想将其存储为列表列表:

mapping <- list(
    "alpha"   = list("alpha"="green", "bravo"="blue", "charlie"="blue",
                     "delta"="red", "echo"="red", "foxtrot"="red"),
    "bravo"   = list("alpha"="blue", "bravo"="green", "charlie"="blue",
                     "delta"="red", "echo"="red", "foxtrot"="red"),
    "charlie" = list("alpha"="blue", "bravo"="blue", "charlie"="green",
                     "delta"="red", "echo"="red", "foxtrot"="red"),
    "delta"   = list("alpha"="red", "bravo"="red", "charlie"="red",
                     "delta"="green", "echo"="yellow", "foxtrot"="red"),
    "echo"    = list("alpha"="red", "bravo"="red", "charlie"="red",
                     "delta"="yellow", "echo"="red", "foxtrot"="red"),
    "foxtrot" = list("alpha"="red", "bravo"="red", "charlie"="red",
                     "delta"="red", "echo"="red", "foxtrot"="green")
)
mapper = function(X, Y) mapping[[levels(X)[X]]][[levels(Y)[Y]]]

请注意,我使用list而不是c在创建mappingmapper使用提取器 ( [[) 而不是子集 ( [) 表示法。

检查这适用于单个值:

> mapper(A[1], B[1])
[1] "red"

并且只有几个值:

> mapper(A[1:2], B[1:2])
Error in mapping[[levels(X)[X]]][[levels(Y)[Y]]] : 
  attempt to select more than one element

所以我们看到mapper不是矢量化的(因为它必须是)。从帮助页面outer

FUN以这两个扩展向量作为参数调用。因此,它必须是一个向量化函数(或一个函数的名称),至少需要两个参数。

将其矢量化的简单但不一定有效的方法:

> Vectorize(mapper)(A[1:2], B[1:2])
[1] "red"   "green"

这现在适用于一个子集:

> outer(A[1:6], B[1:6], FUN=Vectorize(mapper))
     [,1]    [,2]     [,3]    [,4]    [,5]    [,6]    
[1,] "red"   "yellow" "red"   "red"   "red"   "red"   
[2,] "red"   "green"  "red"   "red"   "red"   "yellow"
[3,] "red"   "green"  "red"   "red"   "red"   "yellow"
[4,] "blue"  "red"    "blue"  "red"   "blue"  "red"   
[5,] "green" "red"    "green" "red"   "green" "red"   
[6,] "red"   "red"    "red"   "green" "red"   "red"   

让我们检查一下时间:

> system.time(outer(A[1:6], B[1:6], FUN=Vectorize(mapper)))
   user  system elapsed 
      0       0       0 
> system.time(outer(A[1:60], B[1:60], FUN=Vectorize(mapper)))
   user  system elapsed 
   0.22    0.00    0.22 
> system.time(outer(A[1:600], B[1:600], FUN=Vectorize(mapper)))
   user  system elapsed 
  23.97    0.01   24.01 

看起来在外部产品的长度上是线性的,或者在 A 或 B 的长度上是二次的。我没有等 40 分钟来看看 6000x6000 是否可以工作。

我们可以提高效率吗?对递归结构进行双重索引(然后必须Vectorize在此之上使用)并不是那么有效。让我们使用不同的数据结构:二维数组(矩阵)并使用基于矩阵的索引。

mapping <- matrix(c("green", "blue", "blue", "red", "red", "red", 
                    "blue", "green", "blue", "red", "red", "red", 
                    "blue", "blue", "green", "red", "red", "red",
                    "red", "red", "red", "green", "yellow", "red", 
                    "red", "red", "red", "yellow", "red", "red", 
                    "red", "red", "red", "red", "red", "green"),
                  nrow = 6, ncol = 6,
                  dimnames = list(lev, lev))
mapper <- function(X, Y) mapping[cbind(as.character(X), as.character(Y))]

并测试这个

> A[1:6]
[1] echo    delta   delta   charlie alpha   foxtrot
Levels: alpha bravo charlie echo delta foxtrot
> B[1:6]
[1] alpha   delta   alpha   foxtrot alpha   echo   
Levels: alpha bravo charlie echo delta foxtrot
> mapper(A[1], B[1])
[1] "red"
> mapper(A[1:2], B[1:2])
[1] "red"   "green"
> outer(A[1:6], B[1:6], FUN=mapper)
     [,1]    [,2]     [,3]    [,4]    [,5]    [,6]    
[1,] "red"   "yellow" "red"   "red"   "red"   "red"   
[2,] "red"   "green"  "red"   "red"   "red"   "yellow"
[3,] "red"   "green"  "red"   "red"   "red"   "yellow"
[4,] "blue"  "red"    "blue"  "red"   "blue"  "red"   
[5,] "green" "red"    "green" "red"   "green" "red"   
[6,] "red"   "red"    "red"   "green" "red"   "red"   

看起来不错。检查时间:

> system.time(outer(A[1:6], B[1:6], FUN=mapper))
   user  system elapsed 
      0       0       0 
> system.time(outer(A[1:60], B[1:60], FUN=mapper))
   user  system elapsed 
      0       0       0 
> system.time(outer(A[1:600], B[1:600], FUN=mapper))
   user  system elapsed 
   0.22    0.00    0.22 
> system.time(outer(A, B, FUN=mapper))
   user  system elapsed 
   7.80    1.48    9.30 

大约 250 倍的加速时间比 9 秒多一点,而不是约 40 分钟。

于 2014-07-03T22:14:59.663 回答
2

所以,你的mapping变量不太正确。如果你看

str(mapping)
#  Named chr [1:36] "green" "blue" "blue" "red" "red" "red" ...
#  - attr(*, "names")= chr [1:36] "alpha.alpha" "alpha.bravo" "alpha.charlie" ...

你会看到它是一个一维字符向量。在那里,元素的名称用“.”粘贴在一起。我假设这不是你想要的。也许你用过list()而不是c()?但是如果你可以控制格式,为什么不使用一个简单的矩阵

mapping <- structure(c("green", "blue", "blue", "red", "red", "red", "blue", 
"green", "blue", "red", "red", "red", "blue", "blue", "green", 
"red", "red", "red", "red", "red", "red", "green", "yellow", 
"red", "red", "red", "red", "yellow", "red", "red", "red", "red", 
"red", "red", "red", "green"), .Dim = c(6L, 6L), .Dimnames = list(
    c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot"
    ), c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot"
    )))

所以每个值都有一行和一列,lev单元格的颜色是组合的颜色。

那么如果你这样做

#sample data
lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)

#run mapping
out <- outer(A, B, FUN=function(a,b) mapping[cbind(a,b)])

现在 out 将有沿行的 A 值和沿 cols 的 B 值以及两者之间交互的正确颜色作为单元格值。这运行得很快

system.time(outer(A, B, FUN=function(a,b) mapping[cbind(a,b)]))

#   user  system elapsed 
#   0.90    0.25    1.15 
于 2014-07-03T21:52:59.410 回答
0

我相信这可以在大约 4 秒内完成您想要的操作(如果您不添加名称,运行速度会快 4 倍,但结果看起来不太好)。请注意,非常重要:这仅适用于 和 的级别相同A并且B级别与 的名称相同的情况mapping.mx。这是因为cbind将因子强制转换为数值,所以映射是位置性的。如果不是这种情况,您可以强制转换字符AB它会起作用,但速度会更慢。

names(A) <- A
names(B) <- B

mapping.mx <- do.call(rbind, mapping.lst)  # see below for mapping.lst
system.time(res <- outer(A, B, function(x, y) mapping.mx[cbind(x, y)]))

# user  system elapsed 
# 3.33    0.62    3.95 

str(res)

# chr [1:6000, 1:6000] "red" "green" "green" "blue" "green" "blue" ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:6000] "delta" "alpha" "alpha" "bravo" ...
#  ..$ : chr [1:6000] "alpha" "alpha" "echo" "delta" ...

res[1:5, 1:5]

#       alpha   alpha   echo     delta charlie
# delta "red"   "red"   "yellow" "red" "red"  
# alpha "green" "green" "red"    "red" "blue" 
# alpha "green" "green" "red"    "red" "blue" 
# bravo "blue"  "blue"  "red"    "red" "blue" 
# alpha "green" "green" "red"    "red" "blue"     

并且mapping.lst(基本上,与您的相同,但将第一个更改c为 a list):

mapping.lst <- list(
  "alpha"   = c("alpha"="green", "bravo"="blue", "charlie"="blue",
                "delta"="red", "echo"="red", "foxtrot"="red"),
  "bravo"   = c("alpha"="blue", "bravo"="green", "charlie"="blue",
                "delta"="red", "echo"="red", "foxtrot"="red"),
  "charlie" = c("alpha"="blue", "bravo"="blue", "charlie"="green",
                "delta"="red", "echo"="red", "foxtrot"="red"),
  "delta"   = c("alpha"="red", "bravo"="red", "charlie"="red",
                "delta"="green", "echo"="yellow", "foxtrot"="red"),
  "echo"    = c("alpha"="red", "bravo"="red", "charlie"="red",
                "delta"="yellow", "echo"="red", "foxtrot"="red"),
  "foxtrot" = c("alpha"="red", "bravo"="red", "charlie"="red",
                "delta"="red", "echo"="red", "foxtrot"="green")
)
于 2014-07-03T21:54:30.820 回答