memory - CUDA: memory transaction size for compute capability 1.2 or later

Question

all, From "NVIDIA CUDA Programming Guide 2.0" Section 5.1.2.1: "Coalescing on Devices with Compute Capability 1.2 and Higher"

"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."

I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?

Thanks Deryk

score 2 · Accepted Answer

是的。内存访问总是以 32、64 或 128 字节为单位，无论您实际需要多少内存线。

更新：

问题： 这如何解释 16 位数据的 64 字节？

值：32bytes for 1byte-words, 64bytes for 2byte-words and 128bytes for high-byte words是访问段的最大大小。例如，如果每个线程都在获取 2 字节的字并且您的访问完全对齐，那么内存访问将减少为仅使用 32 字节的行提取。

查看“CUDA编程指南（v3.2）”的G.3.2.2“计算能力1.2和1.3的设备”部分。

我看到您使用了 CUDA PG v. 2.0（可能还有 CUDA 2.0 编译器）。从那以后有很多改进（特别是：错误修复）。

memory - CUDA: memory transaction size for compute capability 1.2 or later

1 回答 1

Related

Reference