0

all, From "NVIDIA CUDA Programming Guide 2.0" Section 5.1.2.1: "Coalescing on Devices with Compute Capability 1.2 and Higher"

"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."

I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?

Thanks Deryk

4

1 回答 1

2

是的。内存访问总是以 32、64 或 128 字节为单位,无论您实际需要多少内存线。


更新:

问题: 这如何解释 16 位数据的 64 字节?

值:32bytes for 1byte-words, 64bytes for 2byte-words and 128bytes for high-byte words是访问段的最大大小。例如,如果每个线程都在获取 2 字节的字并且您的访问完全对齐,那么内存访问将减少为仅使用 32 字节的行提取。

查看“CUDA编程指南(v3.2)”的G.3.2.2“计算能力1.2和1.3的设备”部分。

我看到您使用了 CUDA PG v. 2.0(可能还有 CUDA 2.0 编译器)。从那以后有很多改进(特别是:错误修复)。

于 2011-03-17T21:10:46.150 回答