python - 为什么用指数更快地在 python 中取模数？

Question

我正在尝试优化我正在修改的程序，当我注意到 dovalue = i % 65536似乎比 do 运行得更慢时value = i % (2**16)。

为了测试这一点，我运行了以下程序：

import cProfile
import pstats

AMOUNT = 100000000

def test1():
    for i in xrange(AMOUNT):
        value = i % 65536
    return

def test2():
    for i in xrange(AMOUNT):
        value = i % (256**2)
    return

def test3():
    for i in xrange(AMOUNT):
        value = i % (16**4)
    return

def test4():
    for i in xrange(AMOUNT):
        value = i % (4**8)
    return

def test5():
    for i in xrange(AMOUNT):
        value = i % (2**16)
    return

def run_tests():
    test1()
    test2()
    test3()
    test4()
    test5()
    return

if __name__ == '__main__':
    cProfile.run('run_tests()', 'results')
    stats = pstats.Stats('results')
    stats.sort_stats('calls', 'nfl')
    stats.print_stats()

...产生以下输出：

Fri May 11 15:11:59 2012    results

         8 function calls in 40.473 seconds

   Ordered by: call count, name/file/line

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000   40.473   40.473 <string>:1(<module>)
        1    0.000    0.000   40.473   40.473 test.py:31(run_tests)
        1   10.466   10.466   10.466   10.466 test.py:6(test1)
        1    7.475    7.475    7.475    7.475 test.py:11(test2)
        1    7.485    7.485    7.485    7.485 test.py:16(test3)
        1    7.539    7.539    7.539    7.539 test.py:21(test4)
        1    7.508    7.508    7.508    7.508 test.py:26(test5)

using65536最慢，为 10.466 秒，而 doing256**2最快，为 7.475 秒（其他可能的指数值介于两者之间）。诚然，这种速度差异只有在大量重复的情况下才会明显，但我仍然很好奇为什么会发生这种情况。

为什么取模数65536比使用指数取模要慢？它们应该评估为相同的数字，我原以为 python 解释器在使用 mod 之前需要更长的时间来完全评估指数。

通过扩展，在python表达式中使用2的幂而不是完全输入数字通常更有效吗？这种模式是否适用于除模数以外的运算或除以外的其他数字2吗？

（顺便说一句，我使用的是 Python 2.7.2（32 位），并且我在 64 位 Windows 7 笔记本电脑上运行了上述程序）。

编辑：
所以我尝试颠倒我调用的函数的顺序，现在正好相反。看起来run_tests在使用 cProfile 时，无论第一个函数是什么，运行速度总是会慢一些，这很奇怪。所以，吸取教训，我猜——分析器很奇怪：D

score 19 · Accepted Answer

生成的字节码没有区别，因为编译器很好地完成了它的工作并优化了常量算术表达式。这意味着您的测试结果只是一个巧合（尝试以不同的顺序计时功能！）。

>>> import dis
>>> dis.dis(test1)
  2           0 SETUP_LOOP              30 (to 33)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_GLOBAL              1 (AMOUNT)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                16 (to 32)
             16 STORE_FAST               0 (i)

  3          19 LOAD_FAST                0 (i)
             22 LOAD_CONST               1 (65536)
             25 BINARY_MODULO       
             26 STORE_FAST               1 (value)
             29 JUMP_ABSOLUTE           13
        >>   32 POP_BLOCK           

  4     >>   33 LOAD_CONST               0 (None)
             36 RETURN_VALUE        
>>> dis.dis(test5)
  2           0 SETUP_LOOP              30 (to 33)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_GLOBAL              1 (AMOUNT)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                16 (to 32)
             16 STORE_FAST               0 (i)

  3          19 LOAD_FAST                0 (i)
             22 LOAD_CONST               3 (65536)
             25 BINARY_MODULO       
             26 STORE_FAST               1 (value)
             29 JUMP_ABSOLUTE           13
        >>   32 POP_BLOCK           

  4     >>   33 LOAD_CONST               0 (None)
             36 RETURN_VALUE

（实际上存在差异：数字存储在常量表中的不同偏移量处。不过，我无法想象这会导致任何差异）。

为了完整起见，这是使用该timeit模块的正确测试：

import timeit

setup = "i = 1337"

best1 = best2 = float("inf")
for _ in range(5000):
  best1 = min(best1, timeit.timeit("i % 65536", setup=setup, number=10000))
for _ in range(5000):
  best2 = min(best2, timeit.timeit("i % (2**16)", setup=setup, number=10000))
print best1
print best2

请注意，我正在测量所需的最短时间，而不是平均值。如果由于某种原因需要更长的时间，这只是意味着它被更频繁地中断（因为代码不依赖于任何东西，只依赖于你的 CPU 的能力）。

score 6 · Accepted Answer

嗯，使用dis显示 python 字节代码表明功能是相同的。Python 已经优化了常量（如预期的那样）。所以我怀疑时差是缓存效应。我的笔记本电脑上的时间证明了这一点（在 Linux 上使用 Python 2.7.3 64 位）

Fri May 11 23:37:49 2012    results

     8 function calls in 38.825 seconds

Ordered by: call count, name/file/line

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1    0.000    0.000   38.825   38.825 <string>:1(<module>)
     1    0.000    0.000   38.825   38.825 z.py:31(run_tests)
     1    7.880    7.880    7.880    7.880 z.py:6(test1)
     1    7.658    7.658    7.658    7.658 z.py:11(test2)
     1    7.806    7.806    7.806    7.806 z.py:16(test3)
     1    7.784    7.784    7.784    7.784 z.py:21(test4)
     1    7.697    7.697    7.697    7.697 z.py:26(test5)

几乎完全相同

>>> from dis import dis
>>> def test1():
...     for i in xrange(AMOUNT):
...         value = i % 65536
...     return
... 
>>> def test5():
...     for i in xrange(AMOUNT):
...         value = i % (2**16)
...     return
... 
>>> dis(test1)
  2           0 SETUP_LOOP              30 (to 33)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_GLOBAL              1 (AMOUNT)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                16 (to 32)
             16 STORE_FAST               0 (i)

  3          19 LOAD_FAST                0 (i)
             22 LOAD_CONST               1 (65536)
             25 BINARY_MODULO       
             26 STORE_FAST               1 (value)
             29 JUMP_ABSOLUTE           13
        >>   32 POP_BLOCK           

  4     >>   33 LOAD_CONST               0 (None)
             36 RETURN_VALUE        
>>> dis(test5)
  2           0 SETUP_LOOP              30 (to 33)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_GLOBAL              1 (AMOUNT)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                16 (to 32)
             16 STORE_FAST               0 (i)

  3          19 LOAD_FAST                0 (i)
             22 LOAD_CONST               3 (65536)
             25 BINARY_MODULO       
             26 STORE_FAST               1 (value)
             29 JUMP_ABSOLUTE           13
        >>   32 POP_BLOCK           

  4     >>   33 LOAD_CONST               0 (None)
             36 RETURN_VALUE        
>>>

score 4 · Accepted Answer

您只运行了每个测试一次。你的 CPU 的速度并不总是一样的，在测试开始时它很可能处于休眠状态，这就是第一次测试较慢的原因。要对一小部分代码（如 mod）进行基准测试，请使用 timeit 模块：

>>> timeit.timeit('for i in range(10000): i % 65536', number=1000)
0.8686108589172363
>>> timeit.timeit('for i in range(10000): i % 256**2', number=1000)
0.862062931060791
>>> timeit.timeit('for i in range(10000): i % 4**8', number=1000)
0.8644928932189941
>>> timeit.timeit('for i in range(10000): i % 2**16', number=1000)
0.8643178939819336
>>> timeit.timeit('for i in range(10000): i % 65536', number=1000)
0.8640358448028564

您可以看到平均值始终在 0.864 左右。

python - 为什么用指数更快地在 python 中取模数？

3 回答 3

Related

Reference