1

I followed the instructions here to run octave with nvblas. I have CUDA toolkit 7.5 installed and a tesla k40c GPU. To start octave with nvblas, I used LD_PRELOAD=libnvblas.so octave. I then ran the following simple code:

N = 256
A = rand(N,N)
B = rand(N,N)
A*B

which produces a matrix with reasonable values. However, if I increase N to 512, or any number over 512, I get all zeros (or very small numbers) back as a result.

If I use OpenBLAS this does not happen. The matrices should be small enough that they fit in the card's RAM (12GB). Any idea why this might happen?

Note: If I make A and B identity matrices this does not happen, but it still happens with A = B = ones(N,N).

4

1 回答 1

1

抱歉,这个问题有些陈旧,但我在带有 k80 gpu 的 Amazon AWS EC2 p2.xlarge 实例上进行了尝试,它似乎奏效了。

当我在 nvblas.conf 中使用默认的“NVBLAS_GPU_LIST 0 1”设置时,我得到了与您相似的结果(很多零),这似乎是指两个 GPU,所以我将其更改为只有一个并且它起作用了。完整文件如下:

#Put here the CPU BLAS fallback Library of your choice
NVBLAS_CPU_BLAS_LIB libopenblas.so

# Specify which output log file (default is stderr)
NVBLAS_LOGFILE nvblas.log

# List of GPU devices Id to participate to the computation
# By default if no GPU are listed, only device 0 will be used
NVBLAS_GPU_LIST 0
NVBLAS_AUTOPIN_MEM_ENABLED

程序 (t1.m) 从 NVidia 链接稍作修改,以计算输出矩阵中非零的数量:

N = 16384;

# from the original NVidia example:
#A = single(rand(N,N));
#B = single(rand(N,N));

# double precision seems to work fine (not checked in detail)
A = rand(N,N);
B = rand(N,N);

start = clock();
C = A * B;
elapsedTime = etime(clock(), start);
disp(elapsedTime);
gFlops = 2*N*N*N/(elapsedTime * 1e+9);
disp(gFlops);

disp("number of elements >0:")
disp(sum(sum(C > 0)));

disp("Should be:")
disp(N*N)

仅供参考,这是 nvidia-smi 在上面运行时的输出(它在 N=16384 时达到 172MiB 的使用峰值):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   44C    P0    80W / 149W |     80MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     21080    C   /usr/bin/octave-cli                             78MiB |
+-----------------------------------------------------------------------------+

以下是我之前安装的 nvidia 和 cuda 文件:

cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb  
libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb
libcudnn5_5.1.10-1+cuda8.0_amd64.deb                   
nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb

我似乎得到了大约 8.6 的加速,从普通八度音程大约有 55 gflops,从 GPU 版本得到大约 478。

于 2017-06-06T21:42:16.990 回答