python - 可以使用 numba.guvectorize 来模拟并行 forall / prange？

Question

作为使用 Python 进行数据分析和数值计算的用户，而不是真正的“编码器”，我一直错过了一种在多个内核上分发令人尴尬的并行循环计算的非常低开销的方法。据我了解，Numba 中曾经有 prange 构造，但由于“不稳定和性能问题”而被放弃。

在使用新开源的@guvectorize 装饰器时，我找到了一种使用它来几乎无开销地模拟后期 prange 功能的方法。

感谢 Continuum Analytics 的工作人员，我很高兴现在手头有这个工具，并且在网络上没有找到任何明确提到 @guvectorize 的使用的东西。尽管对于之前一直在使用 NumbaPro 的人来说这可能是微不足道的，但我将其发布给所有那些非编码人员（请参阅我对这个“问题”的回答）。

score -1 · Accepted Answer

Consider the example below, where a two-level nested for loop with a core doing some numerical calculation involving two input arrays and a function of the loop indices is executed in four different ways. Each variant is timed with Ipython's %timeit magic:

naive for loop, compiled using numba.jit
forall-like construct using numba.guvectorize, executed in a single thread (target = "cpu")
forall-like construct using numba.guvectorize, executed in as many threads as there are cpu "cores" (in my case hyperthreads) (target = "parallel")
same as 3., however calling the "guvectorized" forall with the sequence of "parallel" loop indices randomly permuted

The last one is done because (in this particular example) the inner loop's range depends on the value of the outer loop's index. I don't know how exactly the dispatchment of gufunc calls is organized inside numpy, but it appears as if the randomization of "parallel" loop indices achieves slightly better load balancing.

On my (slow) machine (1st gen core i5, 2 cores, 4 hyperthreads) I get the timings:

1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop

Note: I'd be interested if this recipe readily applies to target="gpu" (it should do, but I don't have access to a suitable graphics card right now), and what's the speedup. Please post!

And here's the example:

import numpy as np
from numba import jit, guvectorize, float64, int64

@jit
def naive_for_loop(some_input_array, another_input_array, result):
    for i in range(result.shape[0]):
        for k in range(some_input_array.shape[0] - i):
            result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i)) 

@guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
    i = loop_index[0]       # just a shorthand
    # do some nontrivial calculation involving elements from the input arrays and the loop index
    for k in range(some_input_array.shape[0] - i):
        result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i)) 

@guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
    i = loop_index[0]       # just a shorthand
    # do some nontrivial calculation involving elements from the input arrays and the loop index
    for k in range(some_input_array.shape[0] - i):
        result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i)) 

arg_size = 20000

input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)

# do single-threaded naive nested for loop
# reset result_array inside %timeit call 
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()

# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call 
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()

# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call 
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()

# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call 
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()


# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))

python - 可以使用 numba.guvectorize 来模拟并行 forall / prange？

1 回答 1

Related

Reference