c - Duff 的设备是如何工作的？

Question

我在 Duff 的设备上阅读了 Wikipedia 上的文章，但我不明白。我真的很感兴趣，但是我已经阅读了几次那里的解释，但我仍然不明白 Duff 的设备是如何工作的。

更详细的解释是什么？

score 258 · Accepted Answer

其他地方有一些很好的解释，但让我试一试。（这在白板上要容易得多！）这是带有一些符号的 Wikipedia 示例。

假设您要复制 20 个字节。第一遍程序的流程控制为：

int count;                        // Set to 20
{
    int n = (count + 7) / 8;      // n is now 3.  (The "while" is going
                                  //              to be run three times.)

    switch (count % 8) {          // The remainder is 4 (20 modulo 8) so
                                  // jump to the case 4

    case 0:                       // [skipped]
             do {                 // [skipped]
                 *to = *from++;   // [skipped]
    case 7:      *to = *from++;   // [skipped]
    case 6:      *to = *from++;   // [skipped]
    case 5:      *to = *from++;   // [skipped]
    case 4:      *to = *from++;   // Start here.  Copy 1 byte  (total 1)
    case 3:      *to = *from++;   // Copy 1 byte (total 2)
    case 2:      *to = *from++;   // Copy 1 byte (total 3)
    case 1:      *to = *from++;   // Copy 1 byte (total 4)
           } while (--n > 0);     // N = 3 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //        greater than 0 (and it is)
}

现在，开始第二遍，我们只运行指定的代码：

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 5)
    case 7:      *to = *from++;   // Copy 1 byte (total 6)
    case 6:      *to = *from++;   // Copy 1 byte (total 7)
    case 5:      *to = *from++;   // Copy 1 byte (total 8)
    case 4:      *to = *from++;   // Copy 1 byte (total 9)
    case 3:      *to = *from++;   // Copy 1 byte (total 10)
    case 2:      *to = *from++;   // Copy 1 byte (total 11)
    case 1:      *to = *from++;   // Copy 1 byte (total 12)
           } while (--n > 0);     // N = 2 Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it is)
}

现在，开始第三遍：

int count;                        //
{
    int n = (count + 7) / 8;      //
                                  //

    switch (count % 8) {          //
                                  //

    case 0:                       //
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 13)
    case 7:      *to = *from++;   // Copy 1 byte (total 14)
    case 6:      *to = *from++;   // Copy 1 byte (total 15)
    case 5:      *to = *from++;   // Copy 1 byte (total 16)
    case 4:      *to = *from++;   // Copy 1 byte (total 17)
    case 3:      *to = *from++;   // Copy 1 byte (total 18)
    case 2:      *to = *from++;   // Copy 1 byte (total 19)
    case 1:      *to = *from++;   // Copy 1 byte (total 20)
           } while (--n > 0);     // N = 1  Reduce N by 1, then jump up
                                  //       to the "do" if it's still
    }                             //       greater than 0 (and it's not, so bail)
}                                 // continue here...

现在复制了 20 个字节。

注意：原始 Duff 的设备（如上所示）复制到to地址处的 I/O 设备。因此，没有必要增加指针*to。在两个内存缓冲区之间复制时，您需要使用*to++.

score 115 · Accepted Answer

Dobb 博士日记中的解释是我在该主题上找到的最好的解释。

这是我的 AHA 时刻：

for (i = 0; i < len; ++i) {
    HAL_IO_PORT = *pSource++;
}

变成：

int n = len / 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
    HAL_IO_PORT = *pSource++;
}

n = len % 8;
for (i = 0; i < n; ++i) {
    HAL_IO_PORT = *pSource++;
}

变成：

int n = (len + 8 - 1) / 8;
switch (len % 8) {
    case 0: do { HAL_IO_PORT = *pSource++;
    case 7: HAL_IO_PORT = *pSource++;
    case 6: HAL_IO_PORT = *pSource++;
    case 5: HAL_IO_PORT = *pSource++;
    case 4: HAL_IO_PORT = *pSource++;
    case 3: HAL_IO_PORT = *pSource++;
    case 2: HAL_IO_PORT = *pSource++;
    case 1: HAL_IO_PORT = *pSource++;
               } while (--n > 0);
}

score 84 · Accepted Answer

Duff 的设备有两个关键点。首先，我怀疑这是更容易理解的部分，循环展开。通过避免检查循环是否完成和跳回循环顶部所涉及的一些开销，这以更大的代码大小换取了更快的速度。CPU在执行直线代码而不是跳跃时可以运行得更快。

第二个方面是switch语句。它允许代码第一次跳转到循环的中间。令大多数人惊讶的是，这样的事情是被允许的。嗯，这是允许的。执行从计算的 case 标签开始，然后一直到每个连续的赋值语句，就像任何其他 switch 语句一样。在最后一个 case 标签之后，执行到达循环的底部，此时它跳回顶部。循环的顶部在switch 语句内，因此不再重新评估 switch。

原始循环展开八次，因此迭代次数除以八。如果要复制的字节数不是八的倍数，那么还有一些字节剩余。大多数一次复制字节块的算法将在最后处理剩余的字节，但 Duff 的设备在开始时处理它们。该函数计算count % 8switch 语句以计算余数，跳转到那么多字节的 case 标签，并复制它们。然后循环继续复制八个字节的组。

score 14 · Accepted Answer

duffs 设备的目的是减少在紧凑的 memcpy 实现中进行的比较次数。

假设您想将 'count' 字节从 a 复制到 b，直接的方法是执行以下操作：

  do {                      
      *a = *b++;            
  } while (--count > 0);

您需要比较计数多少次才能查看它是否高于 0？“数”次。

现在，duff 设备使用了开关盒的一个令人讨厌的无意副作用，它允许您减少计数 / 8 所需的比较次数。

现在假设您想使用 duffs 设备复制 20 个字节，您需要多少次比较？只有 3 个，因为您一次复制 8 个字节，除了最后一个您只复制 4 个的第一个字节。

更新：您不必进行 8 次比较/case-in-switch 语句，但在函数大小和速度之间进行权衡是合理的。

score 8 · Accepted Answer

当我第一次阅读它时，我将其自动格式化为

void dsend(char* to, char* from, count) {
    int n = (count + 7) / 8;
    switch (count % 8) {
        case 0: do {
                *to = *from++;
                case 7: *to = *from++;
                case 6: *to = *from++;
                case 5: *to = *from++;
                case 4: *to = *from++;
                case 3: *to = *from++;
                case 2: *to = *from++;
                case 1: *to = *from++;
            } while (--n > 0);
    }
}

我不知道发生了什么。

也许不是在问这个问题时，但现在维基百科有一个很好的解释

凭借 C 中的两个属性，设备是有效的、合法的 C：

语言定义中对 switch 语句的宽松规范。在设备发明的时候，这是 C 编程语言的第一版，它只要求 switch 的受控语句是语法上有效的（复合）语句，其中 case 标签可以出现在任何子语句的前缀。结合以下事实，在没有 break 语句的情况下，控制流将从一个 case 标签控制的语句流向下一个 case 标签控制的语句，这意味着代码指定了从顺序源地址到内存映射的输出端口。

在 C 中合法地跳到循环中间的能力。

score 6 · Accepted Answer

1：Duffs 设备是循环展开的一种特殊实现。循环展开是一种优化技术，如果您有一个在循环中执行 N 次的操作 - 您可以通过执行循环 N/n 次然后在循环中内联（展开）循环代码 n 次来交换程序大小以换取速度，例如替换：

for (int i=0; i<N; i++) {
    // [The loop code...] 
}

和

for (int i=0; i<N/n; i++) {
    // [The loop code...]
    // [The loop code...]
    // [The loop code...]
    ...
    // [The loop code...] // n times!
}

如果 N % n == 0 效果很好 - 不需要 Duff！ 如果那不是真的，那么您必须处理其余部分 - 这是一种痛苦。

2：Duffs 设备与此标准循环展开有何不同？
当 N % n != 0 时，Duffs 设备只是处理剩余循环周期的一种巧妙方法。整个 do / while 根据标准循环展开执行 N / n 次（因为适用情况 0）。在循环的最后一次运行中，案例开始运行，我们将循环代码运行“剩余”次数 - 剩余的通过循环运行“正常”运行。

score 3 · Accepted Answer

虽然我不是 100% 确定你要的是什么，但这里有......

Duff 的设备解决的问题是循环展开之一（您无疑会在您发布的 Wiki 链接上看到）。这基本上等同于优化运行时效率，超过内存占用。Duff 的设备处理串行复制，而不仅仅是任何旧问题，而是如何通过减少需要在循环中进行比较的次数来进行优化的经典示例。

作为一个可能更容易理解的替代示例，假设您有一个要循环的项目数组，并且每次都向它们添加 1...通常，您可能使用 for 循环，并循环大约 100 次. 这看起来很合乎逻辑，而且……但是，可以通过展开循环来进行优化（显然不会太远……或者你也可以不使用循环）。

所以一个常规的for循环：

for(int i = 0; i < 100; i++)
{
    myArray[i] += 1;
}

变成

for(int i = 0; i < 100; i+10)
{
    myArray[i] += 1;
    myArray[i+1] += 1;
    myArray[i+2] += 1;
    myArray[i+3] += 1;
    myArray[i+4] += 1;
    myArray[i+5] += 1;
    myArray[i+6] += 1;
    myArray[i+7] += 1;
    myArray[i+8] += 1;
    myArray[i+9] += 1;
}

Duff 的设备所做的就是在 C 中实现这个想法，但是（正如您在 Wiki 上看到的）使用串行副本。您在上面看到的未展开示例是 10 次比较，而原始版本为 100 次 - 这相当于一个次要但可能很重要的优化。

score 2 · Accepted Answer

这是一个不详细的解释，我认为这是 Duff 设备的症结所在：

问题是，C 基本上是汇编语言的一个很好的外观（具体来说是 PDP-7 汇编；如果你研究过，你会发现它们的相似之处有多么惊人）。而且，在汇编语言中，你并没有真正的循环——你有标签和条件分支指令。所以循环只是整个指令序列的一部分，带有一个标签和一个分支：

        instruction
label1: instruction
        instruction
        instruction
        instruction
        jump to label1  some condition

并且 switch 指令在某种程度上向前分支/跳转：

        evaluate expression into register r
        compare r with first case value
        branch to first case label if equal
        compare r with second case value
        branch to second case label if equal
        etc....
first_case_label: 
        instruction
        instruction
second_case_label: 
        instruction
        instruction
        etc...

在汇编中很容易想象如何组合这两个控制结构，当你这样想时，它们在 C 中的组合似乎不再那么奇怪了。

score 1 · Accepted Answer

这是我发布的另一个关于 Duff 设备的问题的答案，该问题在作为重复问题被关闭之前得到了一些支持。我认为它在这里提供了一些有价值的背景信息，说明为什么应该避免这种结构。

“这是Duff 的设备。它是一种展开循环的方法，它避免了必须添加辅助修复循环来处理循环迭代次数不知道是展开因子的精确倍数的情况。

由于这里的大多数答案似乎总体上是积极的，我将强调缺点。

使用此代码，编译器将难以对循环体应用任何优化。如果您只是将代码编写为一个简单的循环，那么现代编译器应该能够为您处理展开。通过这种方式，您可以保持可读性和性能，并希望将其他优化应用于循环体。

其他人引用的维基百科文章甚至说，当从 Xfree86 源代码中删除这种“模式”时，性能实际上得到了改善。

这种结果是盲目手动优化您碰巧认为可能需要它的任何代码的典型结果。它会阻止编译器正常工作，使您的代码可读性降低并且更容易出现错误，并且通常会减慢它的速度。如果你一开始就以正确的方式做事，即编写简单的代码，然后分析瓶颈，然后进行优化，你甚至不会想到使用这样的东西。无论如何，都不是现代 CPU 和编译器。

理解它很好，但如果你真的使用它，我会感到惊讶。”

score 0 · Accepted Answer

刚刚试验，发现另一种变体在没有交错switch语句和do-循环的情况下相处while：

int n = (count + 1) / 8;
switch (count % 8)
{
    LOOP:
case 0:
    if(n-- == 0)
        break;
    putchar('.');
case 7:
    putchar('.');
case 6:
    putchar('.');
case 5:
    putchar('.');
case 4:
    putchar('.');
case 3:
    putchar('.');
case 2:
    putchar('.');
case 1:
    putchar('.');
default:
    goto LOOP;
}

从技术上讲，它goto仍然实现了一个循环，但这个变体可能更具可读性。

score 0 · Accepted Answer

这是使用 Duff 设备的 64 位 memcpy 的工作示例：

#include <iostream>
#include <memory>

inline void __memcpy(void* to, const void* from, size_t count)
{
    size_t numIter = (count  + 56) / 64;  // gives the number of iterations;  bit shift actually, not division
    size_t rest = count & 63; // % 64
    size_t rest7 = rest&7;
    rest -= rest7;

    // Duff's device with zero case handled:
    switch (rest) 
    {
        case 0:  if (count < 8)
                     break;
                 do { *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 56:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 48:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 40:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 32:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 24:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 16:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
        case 8:      *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
                } while (--numIter > 0);
    }

    switch (rest7)
    {
        case 7: *(((unsigned char*)to)+6) = *(((unsigned char*)from)+6);
        case 6: *(((unsigned short*)to)+2) = *(((unsigned short*)from)+2); goto case4;
        case 5: *(((unsigned char*)to)+4) = *(((unsigned char*)from)+4);
        case 4: case4: *((unsigned long*)to) = *((unsigned long*)from); break; 
        case 3: *(((unsigned char*)to)+2) = *(((unsigned char*)from)+2);
        case 2: *((unsigned short*)to) = *((unsigned short*)from); break;
        case 1: *((unsigned char*)to) = *((unsigned char*)from);
    }
}

void main()
{
    static const size_t NUM = 1024;

    std::unique_ptr<char[]> str1(new char[NUM+1]);  
    std::unique_ptr<char[]> str2(new char[NUM+1]);

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        size_t idx = (i % 62);
        if (idx < 26)
            str1[i] = 'a' + idx;
        else
            if (idx < 52)
                str1[i] = 'A' + idx - 26;
            else
                str1[i] = '0' + idx - 52;
    }

    for (size_t i = 0 ; i < NUM ; ++ i)
    {
        memset(str2.get(), ' ', NUM); 
        __memcpy(str2.get(), str1.get(), i);
        if (memcmp(str1.get(), str2.get(), i) || str2[i] != ' ')
        {
            std::cout << "Test failed for i=" << i;
        }

    }

    return;
}

它处理零长度情况（在原始 Duff 的设备中假设 num>0）。函数 main() 包含 __memcpy 的简单测试用例。

c - Duff 的设备是如何工作的？

11 回答 11

Related

Reference