delphi - 如何在代码中有效地旋转位图

Question

有没有比简单地使用倒置坐标进行嵌套循环更快的方法将大位图旋转 90 度或 270 度？

位图为 8bpp，通常为 2048x2400x8bpp

目前我通过简单地复制参数反转来做到这一点，大致（伪代码：

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

（实际上我用指针来做，速度更快，但幅度大致相同）

GDI 对于大图像非常慢，并且纹理（GF7 卡）的 GPU 加载/存储时间与当前 CPU 时间相同。

任何提示，指针？就地算法甚至会更好，但速度比就地更重要。

目标是德尔福，但它更像是一个算法问题。SSE(2) 向量化没问题，这对我来说在汇编程序中编码已经足够大了

跟进尼尔斯的回答

图像 2048x2700 -> 2700x2048
编译器 Turbo Explorer 2006 已开启优化。
Windows：电源方案设置为“始终开启”。（重要！！！！）
机器：Core2 6600 (2.4 GHz)

使用旧程序的时间：32 毫秒（第 1 步）

步长为 8 的时间：12ms

步长为 16 的时间：10 毫秒

步长 32+ 的时间：9ms

同时我还在 Athlon 64 X2 (5200+ iirc) 上进行了测试，那里的速度比原来的四倍多一点（80 到 19 毫秒）。

加速是值得的，谢谢。也许在夏天的几个月里，我会用 SSE(2) 版本来折磨自己。但是我已经考虑过如何解决这个问题，我想我会用完 SSE2 寄存器来直接实现：

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>

所以 8x8 需要 9 个寄存器，但 32 位 SSE 只有 8 个。无论如何，这是夏天的事情:-)

请注意，指针的事情是我出于本能而做的，但它实际上可能有一些东西，如果你的尺寸没有硬编码，编译器就不能把 mul 变成一个班次。虽然现在 muls an sich 很便宜，但它们也会产生更多的注册压力 afaik。

代码（通过从“naieve”rotate1 实现中减去结果来验证）：

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

更新 2 泛型

我试图在 Delphi XE 中将此代码更新为泛型版本。我因为 QC 99703 失败了，论坛的人已经确认它也存在于 XE2 中。请投票给它:-)

更新 3 泛型 现在可以在 XE10 中使用

更新 4

在 2017 年，我在8x8 立方体的 8bpp 图像的汇编器版本上做了一些工作，以及关于 shuffle 瓶颈的相关SO 问题，Peter Cordes 慷慨地帮助了我。这段代码仍然有一个错过的机会，并且仍然需要另一个循环平铺级别再次将多个 8x8 块迭代聚合成更大的伪迭代，例如 64x64。现在又是整行了，这很浪费。

score 22 · Accepted Answer

是的，有更快的方法来做到这一点。

您的简单循环大部分时间都在缓存未命中。之所以会发生这种情况，是因为您在一个紧密的循环中在非常不同的地方接触了大量数据。更糟糕的是：你的记忆位置正好是两个的幂。这是缓存性能最差的大小。

如果您改进内存访问的局部性，则可以改进此旋转算法。

一种简单的方法是使用与整个位图相同的代码自行旋转每个 8x8 像素块，并包装另一个循环，将图像旋转拆分为每个 8x8 像素的块。

例如这样的事情（未检查，对 C 代码感到抱歉。我的 Delphi 技能不是最新的）：

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 }

还有其他方法。您可以在 Hilbert-Order 或 Morton-Order 中处理数据。理论上这会更快一些，但代码会复杂得多。

顺便说一句 - 既然你提到 SSE 是你的选择。请注意，您可以在 SSE 寄存器中旋转 8x8 字节块。让它工作有点棘手，但看看 SSE 矩阵转置代码应该会让你开始，因为它是同一件事。

编辑：

刚刚检查：

使用 8x8 像素的块大小，代码运行 ca。在我的机器上快 5 倍。块大小为 16x16，它的运行速度提高了 10 倍。

似乎尝试不同的块大小是个好主意。

这是我使用的（非常简单的）测试程序：

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d\n", GetTickCount()-t1);

}

score 3 · Accepted Answer

如果您可以使用 C++，那么您可能需要查看Eigen。

它是一个 C++ 模板库，它使用SSE（2 和更高版本）和 AltiVec 指令集，并优雅地回退到非矢量化代码。

快速地。（见基准）。
表达式模板允许在适当的时候智能地删除临时变量并启用惰性求值——Eigen 会自动处理这一点并在大多数情况下也处理别名。
对 SSE（2 和更高版本）和 AltiVec 指令集执行显式矢量化，并优雅地回退到非矢量化代码。表达式模板允许对整个表达式全局执行这些优化。
使用固定大小的对象，可以避免动态内存分配，并且在有意义时展开循环。
对于大型矩阵，需要特别注意缓存友好性。

score 0 · Accepted Answer

您可以通过在缓存对齐的块中而不是按行进行复制来改进它，因为目前任一 src dest 的步幅都将是未命中的（取决于 delphi 是 row major 还是 column major ）。

score 0 · Accepted Answer

如果图像不是正方形，则无法就地执行。即使您在方形图像中工作，变换也不利于就地工作。

如果您想尝试更快地做事，您可以尝试利用行步幅来使其工作，但我认为您会做的最好的事情是从源代码中一次读取 4 个字节，并且然后将其写入 dest 中的四个连续行。这应该会减少您的一些开销，但我预计不会有超过 5% 的改进。

delphi - 如何在代码中有效地旋转位图

4 回答 4

Related

Reference