cuda - CUDA：是否可以将 48KB 的片上内存全部用作共享内存？

Question

我正在 Windows 7 64 位 SP1 上使用 CUDA Toolkit 4.0 和 Visual Studio 2010 Professional 为 GTX 580 开发 CUDA 应用程序。我的程序比典型的 CUDA 程序更占用内存，我试图为每个 CUDA 块分配尽可能多的共享内存。但是，每次我尝试为每个块使用超过 32K 的共享内存时，程序都会崩溃。

通过阅读 CUDA 官方文档，我了解到计算能力为 2.0 或更高的 CUDA 设备上的每个 SM 都有 48KB 的片上内存，并且片上内存分为 L1 缓存和共享内存：

L1 和共享内存都使用相同的片上内存，对于每个内核调用（第 F.4.1 节） http://developer.download.nvidia.com，可以配置多少专用于 L1 与共享内存 /compute/DevZone/docs/html/C/doc/Fermi_Tuning_Guide.pdf

这让我怀疑当我的程序运行时，只有 32KB 的单芯片内存被分配为共享内存。因此我的问题是：是否可以将所有 48KB 的片上内存用作共享内存？

我尝试了我能想到的一切。我为 nvcc 指定了选项 --ptxas-options="-v -dlcm=cg"，并在我的程序中调用了 cudaDeviceSetCacheConfig() 和 cudaFuncSetCacheConfig()，但它们都没有解决问题。我什至确保没有寄存器溢出，并且我没有意外使用本地内存：

1>      24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 63 registers, 40000+0 bytes smem, 52 bytes cmem[0], 2540 bytes cmem[2], 8 bytes cmem[14], 72 bytes cmem[16]

虽然我可以忍受 32KB 的共享内存，这已经给了我巨大的性能提升，但我宁愿充分利用所有快速的片上内存。任何帮助深表感谢。

更新：程序崩溃时我正在启动 640 个线程。512 给了我比 256 更好的性能，所以我尝试进一步增加线程数。

score 6 · Accepted Answer

Your problem is not related to the shared memory configuration but with the number of threads you are launching.

Using 63 register per threads and launching 640 threads give you a total of 40320 registers. The total amount of register of your device is 32K, so you are running out of resources.

Regarding to the on-chip memory is well explained in the Tom's answer, and as he commented, check the API calls for errors will help you for future errors.

score 3 · Accepted Answer

计算能力 2.0 及更高版本的设备每个 SM 具有 64KB 的片上内存。这可配置为 16KB L1 和 48KB smem 或 48KB L1 和 16KB smem（在计算能力 3.x 上也是 32/32）。

您的程序由于另一个原因而崩溃。您是否在检查所有 API 调用是否有错误？你试过 cuda-memcheck 吗？

如果你使用了太多的共享内存，那么当你启动内核时你会得到一个错误，说资源不足。

score -1 · Accepted Answer

Also, passing parameters from the host to the GPU uses the shared memory (up to 256 bytes) so you will never get the actual 48KB.

cuda - CUDA：是否可以将 48KB 的片上内存全部用作共享内存？

3 回答 3

Related

Reference