编辑:在我回答时,问题似乎发生了巨大变化,但无论如何都要留在这里。
我假设@joran 的评论是正确的,而您的意思是(并在 中修复顺序lev
)
lev <- c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot")
A <- factor(sample(lev, 6000, TRUE), levels=lev)
B <- factor(sample(lev, 6000, TRUE), levels=lev)
此外,mapping
不是您似乎认为的二维数组(矩阵)或嵌套数据结构(列表列表)
> mapping
alpha.alpha alpha.bravo alpha.charlie alpha.delta alpha.echo
"green" "blue" "blue" "red" "red"
alpha.foxtrot bravo.alpha bravo.bravo bravo.charlie bravo.delta
"red" "blue" "green" "blue" "red"
bravo.echo bravo.foxtrot charlie.alpha charlie.bravo charlie.charlie
"red" "red" "blue" "blue" "green"
charlie.delta charlie.echo charlie.foxtrot delta.alpha delta.bravo
"red" "red" "red" "red" "red"
delta.charlie delta.delta delta.echo delta.foxtrot echo.alpha
"red" "green" "yellow" "red" "red"
echo.bravo echo.charlie echo.delta echo.echo echo.foxtrot
"red" "red" "yellow" "red" "red"
foxtrot.alpha foxtrot.bravo foxtrot.charlie foxtrot.delta foxtrot.echo
"red" "red" "red" "red" "red"
foxtrot.foxtrot
"green"
现在,如果您想将其存储为列表列表:
mapping <- list(
"alpha" = list("alpha"="green", "bravo"="blue", "charlie"="blue",
"delta"="red", "echo"="red", "foxtrot"="red"),
"bravo" = list("alpha"="blue", "bravo"="green", "charlie"="blue",
"delta"="red", "echo"="red", "foxtrot"="red"),
"charlie" = list("alpha"="blue", "bravo"="blue", "charlie"="green",
"delta"="red", "echo"="red", "foxtrot"="red"),
"delta" = list("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="green", "echo"="yellow", "foxtrot"="red"),
"echo" = list("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="yellow", "echo"="red", "foxtrot"="red"),
"foxtrot" = list("alpha"="red", "bravo"="red", "charlie"="red",
"delta"="red", "echo"="red", "foxtrot"="green")
)
mapper = function(X, Y) mapping[[levels(X)[X]]][[levels(Y)[Y]]]
请注意,我使用list
而不是c
在创建mapping
中mapper
使用提取器 ( [[
) 而不是子集 ( [
) 表示法。
检查这适用于单个值:
> mapper(A[1], B[1])
[1] "red"
并且只有几个值:
> mapper(A[1:2], B[1:2])
Error in mapping[[levels(X)[X]]][[levels(Y)[Y]]] :
attempt to select more than one element
所以我们看到mapper
不是矢量化的(因为它必须是)。从帮助页面outer
:
FUN
以这两个扩展向量作为参数调用。因此,它必须是一个向量化函数(或一个函数的名称),至少需要两个参数。
将其矢量化的简单但不一定有效的方法:
> Vectorize(mapper)(A[1:2], B[1:2])
[1] "red" "green"
这现在适用于一个子集:
> outer(A[1:6], B[1:6], FUN=Vectorize(mapper))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "red" "yellow" "red" "red" "red" "red"
[2,] "red" "green" "red" "red" "red" "yellow"
[3,] "red" "green" "red" "red" "red" "yellow"
[4,] "blue" "red" "blue" "red" "blue" "red"
[5,] "green" "red" "green" "red" "green" "red"
[6,] "red" "red" "red" "green" "red" "red"
让我们检查一下时间:
> system.time(outer(A[1:6], B[1:6], FUN=Vectorize(mapper)))
user system elapsed
0 0 0
> system.time(outer(A[1:60], B[1:60], FUN=Vectorize(mapper)))
user system elapsed
0.22 0.00 0.22
> system.time(outer(A[1:600], B[1:600], FUN=Vectorize(mapper)))
user system elapsed
23.97 0.01 24.01
看起来在外部产品的长度上是线性的,或者在 A 或 B 的长度上是二次的。我没有等 40 分钟来看看 6000x6000 是否可以工作。
我们可以提高效率吗?对递归结构进行双重索引(然后必须Vectorize
在此之上使用)并不是那么有效。让我们使用不同的数据结构:二维数组(矩阵)并使用基于矩阵的索引。
mapping <- matrix(c("green", "blue", "blue", "red", "red", "red",
"blue", "green", "blue", "red", "red", "red",
"blue", "blue", "green", "red", "red", "red",
"red", "red", "red", "green", "yellow", "red",
"red", "red", "red", "yellow", "red", "red",
"red", "red", "red", "red", "red", "green"),
nrow = 6, ncol = 6,
dimnames = list(lev, lev))
mapper <- function(X, Y) mapping[cbind(as.character(X), as.character(Y))]
并测试这个
> A[1:6]
[1] echo delta delta charlie alpha foxtrot
Levels: alpha bravo charlie echo delta foxtrot
> B[1:6]
[1] alpha delta alpha foxtrot alpha echo
Levels: alpha bravo charlie echo delta foxtrot
> mapper(A[1], B[1])
[1] "red"
> mapper(A[1:2], B[1:2])
[1] "red" "green"
> outer(A[1:6], B[1:6], FUN=mapper)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "red" "yellow" "red" "red" "red" "red"
[2,] "red" "green" "red" "red" "red" "yellow"
[3,] "red" "green" "red" "red" "red" "yellow"
[4,] "blue" "red" "blue" "red" "blue" "red"
[5,] "green" "red" "green" "red" "green" "red"
[6,] "red" "red" "red" "green" "red" "red"
看起来不错。检查时间:
> system.time(outer(A[1:6], B[1:6], FUN=mapper))
user system elapsed
0 0 0
> system.time(outer(A[1:60], B[1:60], FUN=mapper))
user system elapsed
0 0 0
> system.time(outer(A[1:600], B[1:600], FUN=mapper))
user system elapsed
0.22 0.00 0.22
> system.time(outer(A, B, FUN=mapper))
user system elapsed
7.80 1.48 9.30
大约 250 倍的加速时间比 9 秒多一点,而不是约 40 分钟。