我正在使用 CuPy (7.0.0) 并尝试使用简单的示例脚本获取并发流:
import cupy as cp
# creating streams
map_streams = []
for i in range(0, 100):
map_streams.append(cp.cuda.stream.Stream(non_blocking=True))
asize = (1000, 100)
# creating arrays on the device
x = cp.ones(asize)
y = cp.ones(asize)
z = cp.ndarray(asize)
# do multiplications in the streams
for stream in map_streams:
with stream:
z = x * y
但是乘法是按顺序执行的。
==8339== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
[...]
432.83ms 18.688us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 42 cupy_multiply__float64_float64_float64 [376]
433.01ms 19.391us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 43 cupy_multiply__float64_float64_float64 [381]
433.32ms 18.720us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 44 cupy_multiply__float64_float64_float64 [386]
433.52ms 19.936us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 45 cupy_multiply__float64_float64_float64 [391]
433.71ms 18.880us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 46 cupy_multiply__float64_float64_float64 [396]
433.89ms 19.680us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 47 cupy_multiply__float64_float64_float64 [401]
434.16ms 19.232us (782 1 1) (128 1 1) 14 0B 0B Tesla K80 (0) 1 48 cupy_multiply__float64_float64_float64 [406]
[...]
谁能告诉我我的脚本有什么问题?
更新:
即使我增加工作量,流也会按顺序处理。
asize = (1000, 200)
x = cp.random.rand(asize[0], asize[1])
y = cp.random.rand(asize[0], asize[1])
z = cp.ndarray(asize)
for stream in map_streams:
with stream:
z = cp.fft.fft2(x*y)
结果如下:
[...]
1.8e+10s 10.784us (391 1 1) (128 1 1) 12 0B 0B Tesla K80 (0) 1 100 cupy_copy__float64_complex128 [5444]
1.8e+10s 20.384us (50 1 1) (8 5 5) 72 7.8125KB 0B Tesla K80 (0) 1 100 void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5491]
1.8e+10s 10.464us (49 1 1) (128 1 1) 72 0B 0B Tesla K80 (0) 1 100 void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5494]
1.8e+10s 29.055us (63 1 1) (10 16 1) 92 0B 6.2500KB Tesla K80 (0) 1 100 void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5496]
1.8e+10s 10.176us (391 1 1) (128 1 1) 12 0B 0B Tesla K80 (0) 1 101 cupy_copy__float64_complex128 [5502]
1.8e+10s 20.896us (50 1 1) (8 5 5) 72 7.8125KB 0B Tesla K80 (0) 1 101 void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5549]
1.8e+10s 10.592us (49 1 1) (128 1 1) 72 0B 0B Tesla K80 (0) 1 101 void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5552]
1.8e+10s 28.831us (63 1 1) (10 16 1) 92 0B 6.2500KB Tesla K80 (0) 1 101 void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5554]
1.8e+10s 10.431us (391 1 1) (128 1 1) 12 0B 0B Tesla K80 (0) 1 102 cupy_copy__float64_complex128 [5560]
1.8e+10s 20.959us (50 1 1) (8 5 5) 72 7.8125KB 0B Tesla K80 (0) 1 102 void dpRadix0125C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5607]
1.8e+10s 10.720us (49 1 1) (128 1 1) 72 0B 0B Tesla K80 (0) 1 102 void dpRadix0008A::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=128, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [5610]
1.8e+10s 28.640us (63 1 1) (10 16 1) 92 0B 6.2500KB Tesla K80 (0) 1 102 void composite_2way_fft<unsigned int=50, unsigned int=1, unsigned int=5, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=10, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [5612]
[...]