all, From "NVIDIA CUDA Programming Guide 2.0" Section "Coalescing on Devices with Compute Capability 1.2 and Higher"

"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."

I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?

Thanks Deryk


1 回答 1


是的。内存访问总是以 32、64 或 128 字节为单位,无论您实际需要多少内存线。


问题: 这如何解释 16 位数据的 64 字节?

值:32bytes for 1byte-words, 64bytes for 2byte-words and 128bytes for high-byte words是访问段的最大大小。例如,如果每个线程都在获取 2 字节的字并且您的访问完全对齐,那么内存访问将减少为仅使用 32 字节的行提取。


我看到您使用了 CUDA PG v. 2.0(可能还有 CUDA 2.0 编译器)。从那以后有很多改进(特别是:错误修复)。

于 2011-03-17T21:10:46.150 回答