performance - 使用 GNAT 优化 Ada 95 中浮点数组的数学运算

Question

考虑下面的代码。这段代码应该以固定的速率处理数据，在一秒钟的批处理中，它是整个系统的一部分，不会占用太多时间。

当运行超过 100 批 1 秒的数据时，程序需要 35 秒（或 35%），循环执行此函数。测试循环是专门用 Ada.RealTime 计时的。数据是预先生成的，因此大部分执行时间肯定在这个循环中。

如何改进代码以将处理时间降至最低？

代码将在 Intel Pentium-M 上运行，它是带有 SSE2 的 P3。

package FF is new Ada.Numerics.Generic_Elementary_Functions(Float);

N : constant Integer := 820;
type A is array(1 .. N) of Float;
type A3 is array(1 .. 3) of A;

procedure F(state  : in out A3;
            result :    out A3;
            l      : in     A;
            r      : in     A) is
   s : Float;
   t : Float;
begin
   for i in 1 .. N loop
      t := l(i) + r(i);
      t := t / 2.0;
      state(1)(i) := t;
      state(2)(i) := t * 0.25 + state(2)(i) * 0.75;
      state(3)(i) := t * 1.0 /64.0 + state(2)(i) * 63.0 /64.0;
      for r in 1 .. 3 loop
         s := state(r)(i);
         t := FF."**"(s, 6.0) + 14.0;
         if t > MAX then
            t := MAX;
         elsif t < MIN then
            t := MIN;
         end if;
         result(r)(i) := FF.Log(t, 2.0);
      end loop;
   end loop;
end;

用于测试的伪代码

create two arrays of 80 random A3 arrays, called ls and rs;
init the state and result A3 array
record the realtime time now, called last
for i in 1 .. 100 loop
   for j in 1 .. 80 loop
      F(state, result, ls(j), rs(j));
   end loop;
end loop;
record the realtime time now, called curr
output the duration between curr and last

score 1 · Accepted Answer

我会在这里支持 Marc C（他通常知道他的东西）。我以前用过gprof和 Gnat。设置起来可能很困难，但它就像一个冠军。如果您愿意，您可以使用它来获取上面每一行代码使用的运行时百分比。

我可以提出一些建议（比如预先计算 63.0/64.0），但是一个好的优化器应该已经在做大部分了。您需要弄清楚它在特别消耗 CPU 的线路中没有做什么，并加快速度。

查看代码，我猜探查器会告诉您求幂和日志操作占用了大部分时间。如果你能找到一种方法来预先计算一些可能会有所帮助的东西。不过，这已经超越了我自己。简介！

score 1 · Accepted Answer

首先让我尝试纠正我的答案：

在我的回答中应该是 FF."****"(s, 6.0), and s ** 6, (not FF."*"(s, 6.0) and s * 6)。[这很奇怪..编辑器仍在尝试从我的文本中删除 *。]

我刚刚检查了由 gad 指向的 Marc C 的源代码，它确实是 ** 6！

我只补充一点，我希望自己做 s**6 会有所改进，使用 s2 := s*s 和 s_to_the_6 := s2 * s2 * s2; ——乔纳森

score 1 · Accepted Answer

t := FF."**"(s, 6.0) + 14.0; 用 t := s ** 6 + 14.0; log's 和 exp's替换可能会更快。——乔纳森

score 0 · Accepted Answer

以下代码改进使运行时间缩短到 8 秒。

然后，添加命令行选项 -O3 和 -mtune=pentium-m 和 -msse2 将运行时间缩短到 0.8 秒。

我的怀疑是：

Log(x, y) 被实现为 Log(x) / Log(y)，如果 y 是常数，这将是昂贵的。(7 秒)
Power(x, y) 实现为循环；条件和跳转是昂贵的。（20 秒）

程序“**”可能是...

r := x; for i in 2 .. y loop; r := r * x; end loop; return r;

修改功能

package FF is new Ada.Numerics.Generic_Elementary_Functions(Float); 

N : constant Integer := 820; 
type A is array(1 .. N) of Float; 
type A3 is array(1 .. 3) of A; 

procedure F(state  : in out A3; 
            result :    out A3; 
            l      : in     A; 
            r      : in     A) is 
   -- Keep the Log of 2 so it is not recalculated
   ONE_ON_LOG_TWO : constant Float := 1 / FF.Log(2.0); 
   M1 : constant Float := 1.0 / 64.0;
   M2 : constant Float := 63.0 / 64.0;
   s : Float; 
   t : Float; 
begin 
   for i in 1 .. N loop 
      t := l(i) + r(i); 
      -- Multiply Not Divide
      t := t * 0.5; 
      state(1)(i) := t; 
      state(2)(i) := t * 0.25 + state(2)(i) * 0.75; 
      state(3)(i) := t * M1 + state(2)(i) * M2; 
      for r in 1 .. 3 loop 
         s := state(r)(i);
         -- Since we know the power hared code the multiply.
         t := s * s * s * s * s * s + 14.0; 
         if t > MAX then 
            t := MAX; 
         elsif t < MIN then 
            t := MIN; 
         end if; 
         -- Don't use Log(x,y) in a loop when y is constant. '
         result(r)(i) := FF.Log(t) * ONE_ON_LOG_TWO; 
      end loop; 
   end loop; 
end;

score 0 · Accepted Answer

你可能想研究制作

A3 类型是 Float 的数组（1 .. N, 1 .. 3）；

这样，外循环中的每个操作都将访问相邻的内存位置，并且您应该从缓存中获得更好的支持：

  state(i)(1) := t; 
  state(i)(2) := t * 0.25 + state(i)(2) * 0.75; 
  state(i)(3) := t * M1 + state(i)(2) * M2;

使用重命名为

  cur_state : array (1..3) of Float renames state(i);

随后

  cur_state := (t, t * 0.25 + cur_state(2) * 0.75, t * M1 + cur_state(2) * M2)

可能会给编译器一些优化提示。

performance - 使用 GNAT 优化 Ada 95 中浮点数组的数学运算

5 回答 5

Related

Reference