i've got a few questions regarding cuda's scheduling system.
A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?
B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>();
x,y =?
C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?
Thanks for your help