performance - Perl 快速矩阵乘法

Question

我在 perl http://en.wikipedia.org/wiki/Fisher_information中实现了以下统计计算。
结果是正确的。我知道这一点是因为我有 100 个匹配输入和输出的测试用例。问题是我每次运行脚本时都需要计算多次。对该函数的平均调用次数约为 530。我使用 Devel::NYTProf 找出了这一点以及慢速部分的位置。我已经优化了算法，只遍历矩阵的上半部分并将其反射到底部，因为它们是相同的。我不是 perl 专家，但我需要知道是否有什么可以尝试加快 perl 的速度。该脚本分发给客户端，因此编译 C 文件不是一种选择。我可以尝试另一个 perl 库吗？如果可能的话，这需要亚秒级的速度。

更多信息是 $MatrixRef 是一个浮点数矩阵，它是 $rows by $variables。这是该函数的 NYTProf 转储。

#-----------------------------------------------
#
#-----------------------------------------------
sub ComputeXpX
# spent 4.27s within ComputeXpX which was called 526 times, avg 8.13ms/call:
# 526 times (4.27s+0s) by ComputeEfficiency at line 7121, avg 8.13ms/call
{
526 0s                  my ($MatrixRef, $rows, $variables) = @_;

526 0s                  my $r = 0;
526 0s                  my $c = 0;
526 0s                  my $k = 0;
526 0s                  my $sum = 0;
526 0s                  my @xpx = ();

526 11.0ms              for ($r = 0; $r < $variables; $r++)
                        {
14202   19.0ms                my @temp = (0) x $variables;
14202   6.01ms                push(@xpx, \@temp);
526 0s                  }
526 7.01ms              for ($r = 0; $r < $variables; $r++)
                        {
14202   144ms                   for ($c = $r; $c < $variables; $c++)
                                {
198828  43.0ms                                  $sum = 0;
                                        #for ($k = 0; $k < $rows; $k++)
198828  101ms                           foreach my $RowRef (@{$MatrixRef})
                                        {
                                                #$sum += $MatrixRef->[$k]->[$r]*$MatrixRef->[$k]->[$c];
6362496 3.77s                                   $sum += $RowRef->[$r]*$RowRef->[$c];
                                        }

198828  80.1ms                                  $xpx[$r]->[$c] = $sum;
                                                #reflect on other side of matrix
198828  82.1ms                                  $xpx[$c]->[$r] = $sum if ($r != $c);
14202   1.00ms                          }
526 2.00ms                  }

526 2.00ms                  return \@xpx;
}

score 2 · Accepted Answer

由于结果矩阵的每个元素都可以独立计算，因此应该可以并行计算部分/全部。换句话说，最内层循环的任何实例都不依赖于任何其他实例的结果，因此它们可以同时在自己的线程上运行。

score 2 · Accepted Answer

在这里你真的无能为力，除非用 C 重写部分，或者迁移到比准 Perl 更好的数学运算框架（→ PDL！）。

一些小的优化思路：

您@xpx使用包含零的 arrayrefs 进行初始化。这是不必要的，因为无论哪种方式，您都可以为每个位置分配一个值。如果要预先分配数组空间，请分配给$#array值：
```
my @array;
$#array = 100; # preallocate space for 101 scalars
```
这通常没有用，但您可以在有和没有的情况下进行基准测试。
迭代范围；不要使用 C 风格的for循环：
```
for my $c ($r .. $variables - 1) { ... }
```
Perl 标量对于数学运算不是很快，因此将范围迭代卸载到较低级别将获得加速。
尝试更改循环的顺序，并尝试缓存一定级别的数组访问。保持$my $xpx_r = $xpx[$r]在标量中将减少数组访问的数量。如果您的输入足够大，这将转化为速度增益。请注意，这仅在缓存值是引用时才有效。

请记住 perl 很少做“大”优化，并且编译生成的操作码树与您的源代码非常相似。

编辑：关于线程

Perl 线程是真正克隆当前解释器的重量级野兽。这非常像分叉。

跨线程边界共享数据结构是可能的（use threads::shared; my $variable :shared = "foo"），但存在各种缺陷。在Thread::Queue.

将一个产品的计算拆分到多个线程可能最终导致您的线程进行的通信多于计算。您可以对在线程之间划分某些行的责任的解决方案进行基准测试。但我认为在这里有效地重新组合解决方案会很困难。

更有可能有用的是从一开始就运行一堆工作线程。所有线程都侦听一个队列，该队列包含一对矩阵和一个返回队列。然后工作人员将问题出列，并将解决方案发回。多个计算可以并行运行，但单个矩阵乘法会更慢。您的其他代码必须进行大量重构才能利用并行性。

未经测试的代码：

use strict; use warnings; use threads; use Thread::Queue;

# spawn worker threads:
my $problem_queue = Thread::Queue->new;
my @threads = map threads->new(\&worker, $problem_queue), 1..3; # make 3 workers

# automatically close threads when program exits
END {
  $problem_queue->enqueue((undef) x @threads);
  $_->join for @threads;
}

# This is the wrapper around the threading,
# and can be called exactly as ComputeXpX
sub async_XpX {
  my $return_queue = Thread::Queue->new();
  $problem_queue->enqueue([$return_queue, @_]);
  return sub { $return_queue->dequeue };
}

# The main loop of worker threads
sub worker {
  my ($queue) = @_;
  while(defined(my $problem = $queue->dequeue)) {
     my ($return, @args) = @$problem;
     $return->enqueue(ComputeXpX(@args));
  }
}

sub ComputeXpX { ... } # as before

async_XpX返回一个 coderef，最终将收集计算结果。这使我们可以继续处理其他事情，直到我们需要结果为止。

# start two calculations
my $future1 = async_XpX(...);
my $future2 = async_XpX(...);

...; # do something else

# collect the results
my $result1 = $future1->();
my $result2 = $future2->();

我在没有进行实际计算的情况下对准系统线程代码进行了基准测试，并且通信与计算一样昂贵。也就是说，如果运气好的话，您可能会开始在一台至少有四个处理器/内核线程的机器上受益。

关于分析线程代码的注释：我不知道如何优雅地做到这一点。对线程代码进行基准测试，但使用单线程测试用例进行分析可能更可取。

performance - Perl 快速矩阵乘法

2 回答 2

编辑：关于线程

Related

Reference