c - 如何优化这个 Langton's ant sim？

Question

我正在编写 Langton 的 ant sim（用于规则字符串 RLR）并试图优化它以提高速度。这是相关的代码：

#define AREA_X 65536
#define AREA_Y 65536
#define TURN_LEFT 3
#define TURN_RIGHT 1
int main()
{
  uint_fast8_t* state;
  uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2);
  uint_fast8_t ant_orientation=0;
  uint_fast8_t two_pow_five=32;
  uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/
  uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT};
  int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1};
  ... initialise empty state
  while(1)
  {
    while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/
    {
      while(--two_pow_thirty_two)
      {
        /*one iteration*/
        /* 54 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3;
        state[ant] = (36 >> (state[ant] *2) ) & 3;
        ant+=move_ant[ant_orientation];
        */

        /* 47 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        state[ant] += (state[ant]==2)?-2:1;
        ant+=move_ant[ant_orientation];
        */

        /* 46 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        if(state[ant]==2)
        {
          --state[ant];
          --state[ant];
        }
        else
          ++state[ant];
        ant+=move_ant[ant_orientation];
        */

        /* 44 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3;
        if(state[ant]==3)state[ant]=0;
        ant+=move_ant[ant_orientation];
        */

        // 37 seconds for init + 2^32 steps
        // handle every situation with nested switches and constants
        switch(ant_orientation)
        {
          case 0:
            switch(state[ant])
            {
              case 0:
                ant_orientation=1;
                state[ant]=1;
                ++ant;
                break;
              case 1:
                ant_orientation=3;
                state[ant]=2;
                --ant;
                break;
              case 2:
                ant_orientation=1;
                state[ant]=0;
                ++ant;
                break;
            }
            break;
          case 1:
            switch(state[ant])
            {
              ...
            }
            break;
          case 2:
            switch(state[ant])
            {
              ...
            }
            break;
          case 3:
            switch(state[ant])
            {
              ...
            }
            break;
        }

      }
    }
    two_pow_five=32;
    ... dump to file every 2^37 steps
  }
  return 0;
}

我有两个问题：

我已经尝试通过反复试验尽可能地优化 c，有没有我没有利用的技巧？请尝试用 c 语言而不是汇编语言交谈，尽管我可能会在某个时候尝试汇编语言。
有没有更好的方法来模拟问题以提高速度？

更多信息：可移植性并不重要。我在 64 位 linux 上，使用 gcc、i5-2500k 和 16 GB 的内存。现有的状态数组使用 4GiB，程序可以使用 12GiB 的 ram。大小（uint_fast8_t）=1。边界检查是故意不存在的，很容易从转储中手动发现损坏。

编辑：也许与直觉相反，堆积 switch 语句而不是消除它们已经产生了迄今为止最好的效率。

编辑：我已经对问题进行了重新建模，并提出了比每次迭代一步更快的方法。以前，每个状态元素使用两个位并描述 Langton 蚂蚁网格中的单个单元格。新方法使用所有 8 位，并描述网格的 2x2 部分。通过查找当前状态+方向的步数、新方向和新状态的预计算值，每次迭代都会完成可变数量的步骤。假设一切都同样可能，每次迭代平均采取 2 步。作为奖励，它使用 1/4 的内存来模拟相同的区域：

while(--iteration)
{
        // roughly 31 seconds per 2^32 steps
        table_offset=(state[ant]*24)+(ant_orientation*3);
        it+=twoxtwo_table[table_offset+0];
        state[ant]=twoxtwo_table[table_offset+2];
        ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])];
}

还没有尝试优化它，接下来要尝试的是消除偏移方程，并像以前一样使用嵌套开关和常量进行查找（但使用 648 个内部案例而不是 12 个）。

score 3 · Accepted Answer

或者，您可以使用单个无符号字节常量作为人工寄存器而不是分支：

value:   1  3  1  1
bits:   01 11 01 01 ---->101 decimal value for an unsigned byte
index    3  2  1  0  ---> get first 2 bits to get  "1"  (no shift)
                      --> get second 2 bits to get "1"  (shifting for 2 times)
                      --> get third 2 bits to get  "3"  (shifting for 4 times)
                      --> get last 2 bits to get   "1"  (shifting for 6 times)

 Then "AND" the result with  binary(11) or decimal(3) to get your value.

 (101>>( (++state[ant])*2  ) ) & 3 would give you the turnright or turnleft

 Example:
 ++state[ant]= 0:  ( 101>>( (0)*2 ) )&3  --> 101 & 3 = 1
 ++state[ant]= 1:  ( 101>>( (1)*2 ) )&3  --> 101>>2 & 3 = 1
 ++state[ant]= 2:  ( 101>>( (2)*2 ) )&3  --> 101>>4 & 3 = 3  -->turn left
 ++state[ant]= 3:  ( 101>>( (3)*2 ) )&3  --> 101>>6 & 3 = 1

 Maximum six-shifting + one-multiplication + one-"and" may be better.
 Dont forget constant can be auto-promoted so you may add some suffixes or something else.

由于您对 %4 模数使用“unsigned int”，因此可以使用“and”操作。

  state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;

对于不熟练的编译器，这应该会提高速度。

最难的部分：模3

  C = A % B is equivalent to C = A – B * (A / B)
  We need  state[ant]%3
  Result = state[ant] - 3 * (state[ant]/3)

  state[ant]/3 is always <=1 for your valid direction states.
  Only when state[ant]  is 3 then state[ant]/3 is 1, other values give 0.
  When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0)
  Result = state[ant] - (0 or 3)

  Lets look at all possibilities:

  state[ant]=0:  0 - 0  ---> 0   ----> 00100100 shifted by 0 times &3 --> 00000000
  state[ant]=1:  1 - 0  ---> 1   ----> 00100100 shifted by 2 times &3 --> 00000001
  state[ant]=2:  2 - 0  ---> 2   ----> 00100100 shifted by 4 times &3 --> 00000010
  state[ant]=3:  3 - 3  ---> 0   ----> 00100100 shifted by 6 times &3 --> 00000000

  00100100 is 36 in decimal.

  (36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3)

  Example: 

  state[ant]=0: 36 >> 0  --> 36 ----> 36& 3 ----> 0  satisfies 0%3
  state[ant]=1: 36 >> 2  --> 9 -----> 9 & 3 ----> 1  satisfies 1%3
  state[ant]=2: 36 >> 4  --> 2 -----> 2 & 3 ----> 2 satisfies  2%3
  state[ant]=3: 36 >> 6  --> 0 -----> 0 & 3 ----> 0 satisfies 3%3

c - 如何优化这个 Langton's ant sim？

1 回答 1

Related

Reference