0

The PTX manual (version 2.3) (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_2.3.pdf) 6.4.2 states:

Array elements can be accessed using an explicitly calculated byte address, or by indexing into the array using square-bracket notation. The expression within square brackets is either a constant integer, a register variable, or a simple “register with constant offset” expression, where the offset is a constant expression that is either added or subtracted from a register variable. If more complicated indexing is desired, it must be written as an address calculation prior to use.

 ld.global.u32  s, a[0]; 
 ld.global.u32  s, a[N-1]; 
 mov.u32  s, a[1];  // move address of a[1] into s 

When I try this I can only get the version pointer plus byte offset to work, i.e. [a+0].

This code fails to load:

.reg    .f32 f<1>;
.global .f32 a[10];
ld.global.f32 f0,a[0];

Whereas this loads fine:

.reg    .f32 f<1>;
.global .f32 a[10];
ld.global.f32 f0,[a+0];

The problem with the byte offset version is that it really is a byte offset. So, one has to take the underlying size of the type into account, i.e. the second element is [a+4]. Whereas a[1] is supposed to work this out for you.

Ideas what's going wrong?

EDIT

And there is an even more severe issue here involved: The above text states that a register variable can be used to index the array, like:

ld.global.f32 f0,a[u0];

where u0 is probably a .reg.u32 or some other compatible integer.

However, with the pointer plus byte offset method this is not possible. It is illegal to do something like:

mul.u32 u1,u0,4;
ld.global.f32 f0,[a+u1]; // here a reg variable is not allowed.

Now this is a severe limitation. however, one can do another address calculation prior to the load statement. But this complicates things.

4

1 回答 1

1

This does not seem to fit with the PTX documentation you quoted, but you can add in a multiplier corresponding with the size of the items in your array. For instance, to get the 10th 32-bit word:

ld.const.u32 my_u32, [my_ptr + 10 * 4];
于 2012-11-09T15:06:59.067 回答