python - 为什么 python 中的函数/方法调用很昂贵？

Question

在这篇文章中，Guido van Rossum 说函数调用可能很昂贵，但我不明白为什么，也不知道会贵多少。

一个简单的函数调用给你的代码增加了多少延迟，为什么？

score 30 · Accepted Answer

函数调用需要暂停当前执行帧，并创建一个新帧并将其压入堆栈。与许多其他操作相比，这是相对昂贵的。

您可以测量模块所需的确切时间timeit：

>>> import timeit
>>> def f(): pass
... 
>>> timeit.timeit(f)
0.15175890922546387

100 万次调用空函数需要 1/6 秒；您可以将所需的时间与您考虑放入函数的时间进行比较；如果性能是一个问题，则需要考虑 0.15 秒。

score 17 · Accepted Answer

Python 有一个“相对较高”的函数调用开销，这是我们为 Python 的一些最有用的功能付出的代价。

猴子补丁：

你在 Python 中对猴子补丁/覆盖行为有如此大的能力，以至于解释器不能保证给定

 a, b = X(1), X(2)
 return a.fn() + b.fn() + a.fn()

a.fn() 和 b.fn() 相同，或者在调用 b.fn() 后 a.fn() 将相同。

In [1]: def f(a, b):
   ...:     return a.fn() + b.fn() + c.fn()
   ...:

In [2]: dis.dis(f)
  1           0 LOAD_FAST                0 (a)
              3 LOAD_ATTR                0 (fn)
              6 CALL_FUNCTION            0
              9 LOAD_FAST                1 (b)
             12 LOAD_ATTR                0 (fn)
             15 CALL_FUNCTION            0
             18 BINARY_ADD
             19 LOAD_GLOBAL              1 (c)
             22 LOAD_ATTR                0 (fn)
             25 CALL_FUNCTION            0
             28 BINARY_ADD
             29 RETURN_VALUE

在上面，您可以看到在每个位置都查找了“fn”。这同样适用于变量，但人们似乎更清楚这一点。

In [11]: def g(a):
    ...:     return a.i + a.i + a.i
    ...:

In [12]: dis.dis(g)
  2           0 LOAD_FAST                0 (a)
              3 LOAD_ATTR                0 (i)
              6 LOAD_FAST                0 (a)
              9 LOAD_ATTR                0 (i)
             12 BINARY_ADD
             13 LOAD_FAST                0 (a)
             16 LOAD_ATTR                0 (i)
             19 BINARY_ADD
             20 RETURN_VALUE

更糟糕的是，因为模块可以猴子补丁/替换自己/彼此，如果您正在调用全局/模块函数，则每次都必须查找全局/模块：

In [16]: def h():
    ...:     v = numpy.vector(numpy.vector.identity)
    ...:     for i in range(100):
    ...:         v = numpy.vector.add(v, numpy.vector.identity)
    ...:

In [17]: dis.dis(h)
  2           0 LOAD_GLOBAL              0 (numpy)
              3 LOAD_ATTR                1 (vector)
              6 LOAD_GLOBAL              0 (numpy)
              9 LOAD_ATTR                1 (vector)
             12 LOAD_ATTR                2 (identity)
             15 CALL_FUNCTION            1
             18 STORE_FAST               0 (v)

  3          21 SETUP_LOOP              47 (to 71)
             24 LOAD_GLOBAL              3 (range)
             27 LOAD_CONST               1 (100)
             30 CALL_FUNCTION            1
             33 GET_ITER
        >>   34 FOR_ITER                33 (to 70)
             37 STORE_FAST               1 (i)

  4          40 LOAD_GLOBAL              0 (numpy)
             43 LOAD_ATTR                1 (vector)
             46 LOAD_ATTR                4 (add)
             49 LOAD_FAST                0 (v)
             52 LOAD_GLOBAL              0 (numpy)
             55 LOAD_ATTR                1 (vector)
             58 LOAD_ATTR                2 (identity)
             61 CALL_FUNCTION            2
             64 STORE_FAST               0 (v)
             67 JUMP_ABSOLUTE           34
        >>   70 POP_BLOCK
        >>   71 LOAD_CONST               0 (None)
             74 RETURN_VALUE

解决方法

考虑捕获或导入您希望不会改变的任何值：

def f1(files):
    for filename in files:
        if os.path.exists(filename):
            yield filename

# vs

def f2(files):
    from os.path import exists
    for filename in files:
        if exists(filename):
            yield filename

# or

def f3(files, exists=os.path.exists):
    for filename in files:
        if exists(filename):
            yield filename

另见“在野外”部分

但是，并不总是可以导入；例如，你可以导入 sys.stdin 但你不能导入 sys.stdin.readline，numpy 类型也可能有类似的问题：

In [15]: def h():
    ...:     from numpy import vector
    ...:     add = vector.add
    ...:     idy = vector.identity
    ...:     v   = vector(idy)
    ...:     for i in range(100):
    ...:         v = add(v, idy)
    ...:

In [16]: dis.dis(h)
  2           0 LOAD_CONST               1 (-1)
              3 LOAD_CONST               2 (('vector',))
              6 IMPORT_NAME              0 (numpy)
              9 IMPORT_FROM              1 (vector)
             12 STORE_FAST               0 (vector)
             15 POP_TOP

  3          16 LOAD_FAST                0 (vector)
             19 LOAD_ATTR                2 (add)
             22 STORE_FAST               1 (add)

  4          25 LOAD_FAST                0 (vector)
             28 LOAD_ATTR                3 (identity)
             31 STORE_FAST               2 (idy)

  5          34 LOAD_FAST                0 (vector)
             37 LOAD_FAST                2 (idy)
             40 CALL_FUNCTION            1
             43 STORE_FAST               3 (v)

  6          46 SETUP_LOOP              35 (to 84)
             49 LOAD_GLOBAL              4 (range)
             52 LOAD_CONST               3 (100)
             55 CALL_FUNCTION            1
             58 GET_ITER
        >>   59 FOR_ITER                21 (to 83)
             62 STORE_FAST               4 (i)

  7          65 LOAD_FAST                1 (add)
             68 LOAD_FAST                3 (v)
             71 LOAD_FAST                2 (idy)
             74 CALL_FUNCTION            2
             77 STORE_FAST               3 (v)
             80 JUMP_ABSOLUTE           59
        >>   83 POP_BLOCK
        >>   84 LOAD_CONST               0 (None)
             87 RETURN_VALUE

CAVEAT EMPTOR： - 捕获变量不是零成本操作，它会增加帧大小， - 仅在识别热代码路径后使用，

参数传递

Python 的参数传递机制看起来微不足道，但与大多数语言不同，它的成本很高。我们正在谈论将参数分为 args 和 kwargs：

f(1, 2, 3)
f(1, 2, c=3)
f(c=3)
f(1, 2)  # c is auto-injected

CALL_FUNCTION 操作中有很多工作要做，包括从 C 层到 Python 层并返回的潜在转换。

除此之外，传递的参数往往需要查找：

f(obj.x, obj.y, obj.z)

考虑：

In [28]: def fn(obj):
    ...:     f = some.module.function
    ...:     for x in range(1000):
    ...:         for y in range(1000):
    ...:             f(x + obj.x, y + obj.y, obj.z)
    ...:

In [29]: dis.dis(fn)
  2           0 LOAD_GLOBAL              0 (some)
              3 LOAD_ATTR                1 (module)
              6 LOAD_ATTR                2 (function)
              9 STORE_FAST               1 (f)

  3          12 SETUP_LOOP              76 (to 91)
             15 LOAD_GLOBAL              3 (range)
             18 LOAD_CONST               1 (1000)
             21 CALL_FUNCTION            1
             24 GET_ITER
        >>   25 FOR_ITER                62 (to 90)
             28 STORE_FAST               2 (x)

  4          31 SETUP_LOOP              53 (to 87)
             34 LOAD_GLOBAL              3 (range)
             37 LOAD_CONST               1 (1000)
             40 CALL_FUNCTION            1
             43 GET_ITER
        >>   44 FOR_ITER                39 (to 86)
             47 STORE_FAST               3 (y)

  5          50 LOAD_FAST                1 (f)
             53 LOAD_FAST                2 (x)
             56 LOAD_FAST                0 (obj)
             59 LOAD_ATTR                4 (x)
             62 BINARY_ADD
             63 LOAD_FAST                3 (y)
             66 LOAD_FAST                0 (obj)
             69 LOAD_ATTR                5 (y)
             72 BINARY_ADD
             73 LOAD_FAST                0 (obj)
             76 LOAD_ATTR                6 (z)
             79 CALL_FUNCTION            3
             82 POP_TOP
             83 JUMP_ABSOLUTE           44
        >>   86 POP_BLOCK
        >>   87 JUMP_ABSOLUTE           25
        >>   90 POP_BLOCK
        >>   91 LOAD_CONST               0 (None)
             94 RETURN_VALUE

其中“LOAD_GLOBAL”要求对名称进行哈希处理，然后在全局表中查询该哈希值。这是一个 O(log N) 操作。

但请想一想：对于我们的两个简单的 0-1000 循环，我们正在这样做一百万次......

LOAD_FAST 和 LOAD_ATTR 也是哈希表查找，它们只限于特定的哈希表。LOAD_FAST 查询 locals() 哈希表，LOAD_ATTR 查询最后加载的对象的哈希表...

但也要注意，我们在那里调用了一个函数一百万次。幸运的是，它是一个内置函数，内置函数的开销要小得多；但如果这真的是您的性能热点，您可能需要考虑通过执行以下操作来优化范围的开销：

x, y = 0, 0
for i in range(1000 * 1000):
    ....
    y += 1
    if y > 1000:
        x, y = x + 1, 0

您可以对捕获变量进行一些黑客攻击，但它可能对这段代码的性能影响最小，并且只会降低它的可维护性。

但是这个问题的核心pythonic修复是使用生成器或迭代器：

for i in obj.values():
    prepare(i)

# vs

prepare(obj.values())

和

for i in ("left", "right", "up", "down"):
    test_move(i)

# vs

test_move(("left", "right", "up", "down"))

和

for x in range(-1000, 1000):
    for y in range(-1000, 1000):
        fn(x + obj.x, y + obj.y, obj.z)

# vs

def coordinates(obj):
    for x in range(obj.x - 1000, obj.x + 1000 + 1):
        for y in range(obj.y - 1000, obj.y + 1000 + 1):
          yield obj.x, obj.y, obj.z

fn(coordinates(obj))

在野外

您会以如下形式在野外看到这些 pythopticisms：

def some_fn(a, b, c, stdin=sys.stdin):
    ...

这有几个优点：

影响此函数的 help()，（默认输入为标准输入）
为单元测试提供了一个钩子，
将 sys.stdin 提升为本地（LOAD_FAST 与 LOAD_GLOBAL+LOAD_ATTR）

大多数 numpy 调用要么采用或具有采用列表、数组等的变体，如果您不使用这些，您可能会错过 99% 的 numpy 好处。

def distances(target, candidates):
    values = []
    for candidate in candidates:
        values.append(numpy.linalg.norm(candidate - target))
    return numpy.array(values)

# vs

def distances(target, candidates):
    return numpy.linalg.norm(candidates - target)

（注意：这不一定是获取距离的最佳方法，尤其是如果您不打算将距离值转发到其他地方；例如，如果您正在进行范围检查，使用更具选择性的方法可能更有效，以避免使用 sqrt 操作）

优化迭代不仅意味着传递它们，还意味着返回它们

def f4(files, exists=os.path.exists):
    return (filename for filename in files if exists(filename))
           ^- returns a generator expression

score 11 · Accepted Answer

Any statement of the form "X is expensive" doesn't take into account that performance is always relative to whatever else is going on, and relative to however else the task can be done.

There are many questions on SO that express concern about things that might be, but typically are not, performance problems.

As to whether function calls are expensive, there's a general two-part answer.

For functions that do very little and do not call further sub-functions, and that in a particular application are responsible for more than 10% of total wall-clock time, it is worthwhile trying to in-line them or otherwise reduce the cost of invocation.
In applications containing complex data structures and/or tall abstraction hierarchies, function calls are expensive not because of the time they take, but because they tempt you to make more of them than strictly necessary. When this occurs over multiple levels of abstraction, the inefficiencies multiply together, producing a compounded slowdown that is not easily localized.

The way to produce efficient code is a posteriori, not a priori. First write the code so it is clean and maintainable, including function calls as you like. Then while it is running with a realistic workload, let it tell you what can be done to speed it up. Here's an example.

python - 为什么 python 中的函数/方法调用很昂贵？

3 回答 3

Related

Reference