all, From "NVIDIA CUDA Programming Guide 2.0" Section 5.1.2.1: "Coalescing on Devices with Compute Capability 1.2 and Higher"
"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."
I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?
Thanks Deryk