I think you may simply need to learn more about cuda. You may be falling into a trap of thinking that a previous programming paradigm that you learned is something that should be applied here. I'm not sure that's the case.
But to answer your question, first let me point out that thread synchronization in CUDA is only possible within a threadblock. So my comments only apply there.
The principal sync mechanism in device code is __syncthreads(). To use it roughly along the lines you describe, I could code something like this:
__syncthreads();
if (threadIdx.x < 100){
// code in this block will only be executed by threads 0-99, all others do nothing
}
__syncthreads();
if ((threadIdx.x > 99) && (threadIdx.x < 200)){
// code in this block will only be executed by threads 100-199, all others do nothing
}
// all threads can begin executing at this point
Note that even threads in a threadblock are not all executing in lockstep. The SM (the threadblock processing unit in a CUDA GPU) generally breaks threadblocks into groups of 32 threads called warps and these warps are actually (more or less) executing in lockstep. However the code I listed above still has the effect I describe, in terms of sequencing execution amongst groups of threads, if you wanted to do that for some reason.