performance - 为什么 Bresenham 的线算法比 Naive 算法更有效

Question

在我的图形课程中，我们学习了 Naive 线光栅化算法，然后是 Bresenham 的线条绘制算法。我们被告知计算机是整数机器，这就是我们应该使用后者的原因。

如果我们假设没有软件级别的优化，那么对于具有 mmx 和其他指令集的现代 cpu 来说是否如此？正如我查看了英特尔的 64-ia-32-architectures-optimization-manual.pdf 并且与 mmx 的 int 相比，float 的加减乘法延迟相同或更好。
如果算法在 gpu 中执行，这应该重要吗？已检查NVIDIA CUDA Programming Guide 1.0 (pdf)，第 41 页，int 和 float 的时钟周期是相同的。
将 float 转换为 int 的效率低下是什么？load-hit-store 失速对我们来说是一个真正的问题吗？
向上/向下舍入数字的函数的效率如何？（我们可以想到c++ stl中的实现）
Bresenham 算法的效率是由于加法而不是内循环中使用的乘法吗？

score 2 · Accepted Answer

将计算机称为整数机有点误导，但这种观点大多是正确的。据我所知，CPU 使用整数寄存器来生成要读取和写入的内存地址。将线条绘制保存在整数寄存器中意味着您可以避免从其他寄存器复制到整数寄存器以生成内存地址以在线条绘制期间写入像素的开销。

至于你的具体问题：

由于您需要使用通用寄存器来访问内存，因此使用 SSE 或 FPU 计算内存偏移量（指针）仍然会有将数据从这些寄存器传输到通用寄存器的开销。因此，这取决于从一个寄存器集传输到另一个寄存器集的开销是否大于使用特定指令集的性能。
GPU 往往有一个统一的寄存器集，所以它应该没有那么重要。
将浮点数转换为 int 本身并不昂贵。开销来自将数据从一个寄存器组传输到另一个寄存器组。通常这必须通过内存来完成，如果你的 CPU 有 load-hit-store 惩罚，那么这个传输就是它们的一个重要来源。
The performance of rounding up or down depends on the CPU and the compiler. On the slow end, MSVC used to use a function to round to zero which mucked with the FPU control word. On the fast end you have special CPU instructions that handle rounding directly.
Bresenham's line drawing algorithm is fast because it reduces determining where to draw points on a line from the naive y= m*x + b formula to an addition plus a branch (and the branch can be eliminated through well know branchless integer techniques). The run-slice version of Brensenham's line drawing algorithm can be even faster as it determines "runs" of pixels with the same component directly rather than iterating.

1 回答 1