performance - 按列排序或矩阵的前 n 个

Question

我需要对矩阵进行排序，以便所有元素都保留在它们的列中，并且每列都按升序排列。R中的矩阵或数据框是否有向量化的列排序？（我的矩阵是全正的，并且以为界B，所以我可以添加j*B到列中的每个单元格j并进行常规的一维排序：

> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
     [,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
 [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
     [,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92

但是是否已经包含了更漂亮的东西？）否则，如果我的矩阵有大约 1M（10M、100M）个条目（大约是一个方阵），那么最快的方法是什么？我担心apply和朋友的性能损失。

实际上，我不需要“排序”，只需要“前 n”，例如，n 大约是 30 或 100。我正在考虑使用apply和的partial参数sort，但我想知道这是否比仅进行矢量化排序便宜。所以，在我自己做基准测试之前，我想征求有经验的用户的意见。

score 4 · Accepted Answer

If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.

Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.

Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.

score 4 · Accepted Answer

我已经为迄今为止提出的解决方案制定了一个快速测试框架。

library(rbenchmark)

sort.q <- function(m) {
  sort(m, method='quick')
}
sort.p <- function(m) {
  mm <- sort(m, partial=TOP)[1:TOP]
  sort(mm)
}

sort.all.g <- function(f) {
  function(m) {
    o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
    matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
  }
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)

apply.sort.g <- function(f) {
  function(m) {
    apply(m, 2, f)[1:TOP,]
  }
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)

bb <- NULL

SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5

for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
  for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
    print(c(SIZE, TOP))
    TOP <- min(TOP, SIZE)
    m <- matrix(runif(SIZE*SIZE), floor(SIZE))
    if (SIZE < 1000) {
      mr <- apply.sort(m)
      stopifnot(apply.sort.q(m) == mr)
      stopifnot(apply.sort.p(m) == mr)
      stopifnot(sort.all(m) == mr)
      stopifnot(sort.all.q(m) == mr)
    }

    b <- benchmark(apply.sort(m),
                   apply.sort.q(m),
                   apply.sort.p(m),
                   sort.all(m),
                   sort.all.q(m),
                   columns= c("test", "elapsed", "relative",
                              "user.self", "sys.self"),
                   replications=1,
                   order=NULL)
    b$SIZE <- SIZE
    b$TOP <- TOP
    b$test <- factor(x=b$test, levels=b$test)

    bb <- rbind(bb, b)
  }
}

ftable(xtabs(user.self ~ SIZE+test+TOP, bb))

到目前为止的结果表明，除了最大的矩阵之外的所有矩阵，apply除非执行“top n”，否则确实会损害性能。对于 < 1e6 的“小”矩阵，仅对整个事物进行排序apply是有竞争力的。对于“巨大”的矩阵，对整个数组进行排序变得比apply. 使用partial对“巨大”矩阵最有效，对“小”矩阵只有轻微损失。

请随意添加您自己的排序程序:-)

                      TOP      10      31     100     316
SIZE  test                                               
31    apply.sort(m)         0.004   0.012   0.000   0.000
      apply.sort.q(m)       0.008   0.016   0.000   0.000
      apply.sort.p(m)       0.008   0.020   0.000   0.000
      sort.all(m)           0.000   0.008   0.000   0.000
      sort.all.q(m)         0.000   0.004   0.000   0.000
100   apply.sort(m)         0.012   0.016   0.028   0.000
      apply.sort.q(m)       0.016   0.016   0.036   0.000
      apply.sort.p(m)       0.020   0.020   0.040   0.000
      sort.all(m)           0.000   0.004   0.008   0.000
      sort.all.q(m)         0.004   0.004   0.004   0.000
316   apply.sort(m)         0.060   0.060   0.056   0.060
      apply.sort.q(m)       0.064   0.060   0.060   0.072
      apply.sort.p(m)       0.064   0.068   0.108   0.076
      sort.all(m)           0.016   0.016   0.020   0.024
      sort.all.q(m)         0.020   0.016   0.024   0.024
1000  apply.sort(m)         0.356   0.276   0.276   0.292
      apply.sort.q(m)       0.348   0.316   0.288   0.296
      apply.sort.p(m)       0.256   0.264   0.276   0.320
      sort.all(m)           0.268   0.244   0.213   0.244
      sort.all.q(m)         0.260   0.232   0.200   0.208
3162  apply.sort(m)         1.997   1.948   2.012   2.108
      apply.sort.q(m)       1.916   1.880   1.892   1.901
      apply.sort.p(m)       1.300   1.316   1.376   1.544
      sort.all(m)           2.424   2.452   2.432   2.480
      sort.all.q(m)         2.188   2.184   2.265   2.244
10000 apply.sort(m)        18.193  18.466  18.781  18.965
      apply.sort.q(m)      15.837  15.861  15.977  16.313
      apply.sort.p(m)       9.005   9.108   9.304   9.925
      sort.all(m)          26.030  25.710  25.722  26.686
      sort.all.q(m)        23.341  23.645  24.010  24.073
31622 apply.sort(m)       201.265 197.568 196.181 196.104
      apply.sort.q(m)     163.190 160.810 158.757 160.050
      apply.sort.p(m)      82.337  81.305  80.641  82.490
      sort.all(m)         296.239 288.810 289.303 288.954
      sort.all.q(m)       260.872 249.984 254.867 252.087

score 3 · Accepted Answer

做

apply(m, 2, sort)

做这份工作？:)

或者对于前 10 名，例如，使用：

apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})

性能很强 - 对于 1e7 行和 5 列（总共 5e7 个数字），我的计算机大约需要 9 或 10 秒。

score 3 · Accepted Answer

R 在矩阵计算方面非常快。1e4 列中有 1e7 个元素的矩阵在我的机器上在 3 秒内排序

set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)

system.time(sm <- apply(m, 2, sort))
   user  system elapsed 
   2.62    0.14    2.79

前 5 列：

sm[1:15, 1:5]
              [,1]         [,2]         [,3]         [,4]         [,5]
 [1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
 [2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
 [3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
 [4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
 [5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
 [6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
 [7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
 [8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
 [9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03

score 1 · Accepted Answer

他们说天才和疯狂之间只有一线之隔……看看这个，看看你对这个想法的看法。与问题一样，目标是找到vec可能很长的向量的前 30 个元素（1e7、1e8 或更多元素）。

topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
    thresh = thresh + calcsd
    subs = which(vec > thresh)
}
while (length(subs) < topn) {
    thresh = thresh - calcsd
    subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]

基本思想是，即使我们对的分布知之甚少vec，我们当然希望 in 的最高值vec比均值高几个标准差。如果vec是正态分布，那么第qnorm2 行的表达式给出了一个粗略的概念，即我们需要寻找高于平均值多少 sd 才能找到最高topn值（例如，如果 vec 包含 1e8 个值，则前 30 个值可能位于从平均值以上 5 sd 开始的区域。）即使vec不正常，这个假设也不太可能与事实相去甚远。

好的，所以我们计算的平均值和 sd vec，并使用它们提出一个高于平均值的阈值 - 一定数量的 sd 高于平均值。我们希望在这个上尾中找到一个略大于topn值的子集。如果我们这样做，我们可以对其进行排序并轻松识别最高topn值 - 这将是总体上的最高topn值。vec

现在这里的确切规则可能可以稍微调整一下，但我们需要防止原始阈值由于某种原因“出局”。因此，我们利用这样一个事实，即可以快速检查有多少元素位于某个阈值之上。因此，我们首先提高阈值，增量为，直到超过阈值的元素calcsd少于。10 * topn然后，如果需要。我们减少thresh（再次以为步骤calcsd）直到我们确定至少有topn高于阈值的元素。这种双向搜索应该总是导致一个“阈值集”，其大小非常接近topn（希望在 10 或 100 倍内）。作为topn相对较小（典型值为 30），对这个阈值集进行排序会非常快，这当然会立即为我们提供topn原始向量中的最高元素vec。

我的主张是，在 R 中生成一个合适的阈值集所涉及的计算都很快，所以如果只需要一个非常大的向量的前 30 个左右的元素，这种间接方法将击败任何涉及对整个向量进行排序的方法。

你怎么看？！如果您认为这是一个有趣的想法，请喜欢/投票 :) 我会考虑做一些适当的时间安排，但我对随机生成的数据的初步测试确实很有希望 - 在“真实”数据上进行测试会很棒尽管...！

干杯:)

performance - 按列排序或矩阵的前 n 个

5 回答 5

Related

Reference