r - R：`unlist`在对矩阵的子集求和时使用大量时间

Question

我有一个程序，它从 MySQL 数据库中提取数据，解码一对二进制列，然后将这对二进制列中的行的子集相加。在样本数据集上运行程序需要 12-14 秒，其中 9-10 秒占用了unlist. 我想知道是否有任何方法可以加快速度。

表结构

我从数据库中获取的行如下所示：

| array_length | mz_array        | intensity_array |
|--------------+-----------------+-----------------|
|           98 | 00c077e66340... | 002091c37240... |
|           74 | c04a7c7340...   | db87734000...   |

其中array_length是两个数组中 little-endian double 的数量（保证它们的长度相同）。mz_array所以第一行在和中都有 98 个双精度数intensity_array。array_length平均值为 825，中位数为 620，有 13,000 行。

解码二进制数组

每行通过传递给以下函数进行解码。一旦二进制数组被解码，array_length就不再需要了。

DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
  sapply(list(mz_array=mz_array, intensity_array=intensity_array),
         readBin,
         what="double",
         endian="little",
         n=array_length)
}

对数组求和

下一步是对中的值求和intensity_array，但前提是它们的对应条目在mz_array某个窗口内。数组按mz_array, 升序排列。我正在使用以下函数来总结这些intensity_array值：

SumInWindow <- function(spectrum, lower, upper) {
  sum(spectrum[spectrum[,1] > lower & spectrum[,1] < upper, 2])
}

spectrum的输出DecodeSpectrum在哪里matrix？

对行列表进行操作

每行由以下人员处理：

ProcessSegment <- function(spectra, window_bounds) {
  lower <- window_bounds[1]
  upper <- window_bounds[2]
  ## Decode a single spectrum and sum the intensities within the window.
  SumDecode <- function (...) {
    SumInWindow(DecodeSpectrum(...), lower, upper)
  }

  do.call("mapply", c(SumDecode, spectra))
}

ProcessSegment最后，使用此函数获取行并将其传递给：

ProcessAllSegments <- function(conn, window_bounds) {
  nextSeg <- function() odbcFetchRows(conn, max=batchSize, buffsize=batchSize)

  while ((res <- nextSeg())$stat == 1 && res$data[[1]] > 0) {
    print(ProcessSegment(res$data, window_bounds))
  }
}

我正在分段进行提取，以便 R 不必一次将整个数据集加载到内存中（这会导致内存不足错误）。我正在使用 RODBC驱动程序，因为RMySQL驱动程序无法返回未受污染的二进制值（据我所知）。

表现

对于大约 140MiB 的样本数据集，整个过程大约需要 14 秒才能完成，这对于 13,000 行来说还算不错。不过，我认为还有改进的余地，尤其是在查看Rprof输出时：

$by.self
                 self.time self.pct total.time total.pct
"unlist"             10.26    69.99      10.30     70.26
"SumInWindow"         1.06     7.23      13.92     94.95
"mapply"              0.48     3.27      14.44     98.50
"as.vector"           0.44     3.00      10.60     72.31
"array"               0.40     2.73       0.40      2.73
"FUN"                 0.40     2.73       0.40      2.73
"list"                0.30     2.05       0.30      2.05
"<"                   0.22     1.50       0.22      1.50
"unique"              0.18     1.23       0.36      2.46
">"                   0.18     1.23       0.18      1.23
".Call"               0.16     1.09       0.16      1.09
"lapply"              0.14     0.95       0.86      5.87
"simplify2array"      0.10     0.68      11.48     78.31
"&"                   0.10     0.68       0.10      0.68
"sapply"              0.06     0.41      12.36     84.31
"c"                   0.06     0.41       0.06      0.41
"is.factor"           0.04     0.27       0.04      0.27
"match.fun"           0.04     0.27       0.04      0.27
"<Anonymous>"         0.02     0.14      13.94     95.09
"unique.default"      0.02     0.14       0.06      0.41

$by.total
                     total.time total.pct self.time self.pct
"ProcessAllSegments"      14.66    100.00      0.00     0.00
"do.call"                 14.50     98.91      0.00     0.00
"ProcessSegment"          14.50     98.91      0.00     0.00
"mapply"                  14.44     98.50      0.48     3.27
"<Anonymous>"             13.94     95.09      0.02     0.14
"SumInWindow"             13.92     94.95      1.06     7.23
"sapply"                  12.36     84.31      0.06     0.41
"DecodeSpectrum"          12.36     84.31      0.00     0.00
"simplify2array"          11.48     78.31      0.10     0.68
"as.vector"               10.60     72.31      0.44     3.00
"unlist"                  10.30     70.26     10.26    69.99
"lapply"                   0.86      5.87      0.14     0.95
"array"                    0.40      2.73      0.40     2.73
"FUN"                      0.40      2.73      0.40     2.73
"unique"                   0.36      2.46      0.18     1.23
"list"                     0.30      2.05      0.30     2.05
"<"                        0.22      1.50      0.22     1.50
">"                        0.18      1.23      0.18     1.23
".Call"                    0.16      1.09      0.16     1.09
"nextSeg"                  0.16      1.09      0.00     0.00
"odbcFetchRows"            0.16      1.09      0.00     0.00
"&"                        0.10      0.68      0.10     0.68
"c"                        0.06      0.41      0.06     0.41
"unique.default"           0.06      0.41      0.02     0.14
"is.factor"                0.04      0.27      0.04     0.27
"match.fun"                0.04      0.27      0.04     0.27

$sample.interval
[1] 0.02

$sampling.time
[1] 14.66

unlist看到占用了这么多时间，我感到很惊讶。这告诉我，可能会有一些多余的复制或重新排列。我是 R 的新手，所以这完全有可能是正常的，但我想知道是否有什么明显的错误。

更新：发布示例数据

我已经在此处发布了程序的完整版本以及我在此处使用的示例数据。样本数据是 ed 的输出。您需要为脚本设置正确的环境变量以连接到数据库：gzipmysqldump

MZDB_HOST
MZDB_DB
MZDB_USER
MZDB_PW

要运行脚本，您必须指定run_id和窗口边界。我这样运行程序：

Rscript ChromatoGen.R -i 1 -m 600 -M 1200

这些窗口范围是相当随意的，但选择范围的大约一半到三分之一。如果要打印结果，请在对withinprint()的调用周围加上 a 。使用这些参数，前 5 个应该是：ProcessSegmentProcessAllSegments

[1] 7139.682 4522.314 3435.512 5255.024 5947.999

你可能想要限制结果的数量，除非你想要 13,000 个数字填满你的屏幕 :) 最简单的方法就是LIMIT 5在query.

score 0 · Accepted Answer

我想通了！

问题出在sapply()通话中。sapply进行了大量的重命名和属性设置，这大大减慢了这种大小的数组的速度。替换DecodeSpectrum为以下代码使采样时间从14.66几秒缩短到3.36几秒，增加了 4 倍！

这是新的身体DecodeSpectrum：

DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
  ## needed to tell `vapply` how long the result should be. No, there isn't an
  ## easier way to do this.
  resultLength <- rep(1.0, array_length)

  vapply(list(mz_array=mz_array, intensity_array=intensity_array),
         readBin,
         resultLength,
         what="double",
         endian="little",
         n=array_length,
         USE.NAMES=FALSE)
}

输出现在Rprof看起来像：

$by.self
               self.time self.pct total.time total.pct
"<Anonymous>"           0.64    19.75       2.14     66.05
"DecodeSpectrum"        0.46    14.20       1.12     34.57
".Call"                 0.42    12.96       0.42     12.96
"FUN"                   0.38    11.73       0.38     11.73
"&"                     0.16     4.94       0.16      4.94
">"                     0.14     4.32       0.14      4.32
"c"                     0.14     4.32       0.14      4.32
"list"                  0.14     4.32       0.14      4.32
"vapply"                0.12     3.70       0.66     20.37
"mapply"                0.10     3.09       2.54     78.40
"simplify2array"        0.10     3.09       0.30      9.26
"<"                     0.08     2.47       0.08      2.47
"t"                     0.04     1.23       2.72     83.95
"as.vector"             0.04     1.23       0.08      2.47
"unlist"                0.04     1.23       0.08      2.47
"lapply"                0.04     1.23       0.04      1.23
"unique.default"        0.04     1.23       0.04      1.23
"NextSegment"           0.02     0.62       0.50     15.43
"odbcFetchRows"         0.02     0.62       0.46     14.20
"unique"                0.02     0.62       0.10      3.09
"array"                 0.02     0.62       0.04      1.23
"attr"                  0.02     0.62       0.02      0.62
"match.fun"             0.02     0.62       0.02      0.62
"odbcValidChannel"      0.02     0.62       0.02      0.62
"parent.frame"          0.02     0.62       0.02      0.62

$by.total
                     total.time total.pct self.time self.pct
"ProcessAllSegments"       3.24    100.00      0.00     0.00
"t"                        2.72     83.95      0.04     1.23
"do.call"                  2.68     82.72      0.00     0.00
"mapply"                   2.54     78.40      0.10     3.09
"<Anonymous>"              2.14     66.05      0.64    19.75
"DecodeSpectrum"           1.12     34.57      0.46    14.20
"vapply"                   0.66     20.37      0.12     3.70
"NextSegment"              0.50     15.43      0.02     0.62
"odbcFetchRows"            0.46     14.20      0.02     0.62
".Call"                    0.42     12.96      0.42    12.96
"FUN"                      0.38     11.73      0.38    11.73
"simplify2array"           0.30      9.26      0.10     3.09
"&"                        0.16      4.94      0.16     4.94
">"                        0.14      4.32      0.14     4.32
"c"                        0.14      4.32      0.14     4.32
"list"                     0.14      4.32      0.14     4.32
"unique"                   0.10      3.09      0.02     0.62
"<"                        0.08      2.47      0.08     2.47
"as.vector"                0.08      2.47      0.04     1.23
"unlist"                   0.08      2.47      0.04     1.23
"lapply"                   0.04      1.23      0.04     1.23
"unique.default"           0.04      1.23      0.04     1.23
"array"                    0.04      1.23      0.02     0.62
"attr"                     0.02      0.62      0.02     0.62
"match.fun"                0.02      0.62      0.02     0.62
"odbcValidChannel"         0.02      0.62      0.02     0.62
"parent.frame"             0.02      0.62      0.02     0.62

$sample.interval
[1] 0.02

$sampling.time
[1] 3.24

可能会因为打乱do.call('mapply', ...)电话而挤出一些额外的性能，但我对性能非常满意，因为我不愿意在这上面浪费时间。

r - R：`unlist`在对矩阵的子集求和时使用大量时间

表结构

解码二进制数组

对数组求和

对行列表进行操作

表现

更新：发布示例数据

1 回答 1

Related

Reference