gpgpu - Vulkan subgroupBarrier does not synchronize invokations

Question

I have a somewhat complex procedure that contains nested loop and a subgroupBarrier. In a simplified form it looks like

while(true){
   while(some_condition){
      if(end_condition){
          atomicAdd(some_variable,1);
          debugPrintfEXT("%d:%d",gl_SubgroupID,gl_SubgroupInvocationID.x);
          subgroupBarrier();
          if(gl_SubgroupInvocationID.x==0){
              debugPrintfEXT("Finish! %d", some_variable);
              // do some final stuff
          }
          return; // this is the only return in the entire procedure
      }
      // do some stuff
   }
   // do some stuff
}

Overall the procedure is correct and it does what's expected from it. All subgroup threads always eventually reach the end condition. However, in my logs I see

0:2
0:3
0:0
Finish! 3
0:1

And it's not just the matter of logs being displayed out of order. I perform atomic addition and it seems to be wrong too. I need all threads to finish all their atomic operations before printing Finish!. If the subgroupBarrier() worked correctly, it should print 4, but in my case it prints 3. I've been mostly following this tutorial https://www.khronos.org/blog/vulkan-subgroup-tutorial and it says that

void subgroupBarrier() performs a full memory and execution barrier - basically when an invocation returns from subgroupBarrier() we are guaranteed that every invocation executed the barrier before any return, and all memory writes by those invocations are visible to all invocations in the subgroup.

Interestingly I tried changing if(gl_SubgroupInvocationID.x==0) to other numbers. For example if(gl_SubgroupInvocationID.x==3) yields

0:2
0:3
Finish! 2
0:0
0:1

So it seems like the subgroupBarrier() is entirely ignored.

Could the nested loop be the cause of the problem or is it something else?

Edit:

I provide here more detailed code

#version 450
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_EXT_debug_printf : enable

layout (local_size_x_id = GROUP_SIZE_CONST_ID) in; // this is a specialization constant whose value always matches the subgroupSize

shared uint copied_faces_idx;

void main() {
    const uint chunk_offset = gl_WorkGroupID.x;
    const uint lID = gl_LocalInvocationID.x;
    // ... Some less important stuff happens here ...
    const uint[2] ending = uint[2](relocated_leading_faces_ending, relocated_trailing_faces_ending);
    const uint[2] beginning = uint[2](offset_to_relocated_leading_faces, offset_to_relocated_trailing_faces);
    uint part = 0;
    face_offset = lID;
    Face face_to_relocate = faces[face_offset];
    i=-1;
    debugPrintfEXT("Stop 1: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
    subgroupBarrier(); // I added this just to test see what happens
    debugPrintfEXT("Stop 2: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
    while(true){
        while(face_offset >= ending[part]){
            part++;
            if(part>=2){
                debugPrintfEXT("Stop 3: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
                subgroupBarrier();
                debugPrintfEXT("Stop 4: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
                for(uint i=lID;i<inserted_face_count;i+=GROUP_SIZE){
                    uint offset = atomicAdd(copied_faces_idx,1);
                    face_to_relocate = faces_to_be_inserted[i];
                    debugPrintfEXT("Stop 5: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
                    tmp_faces_copy[offset+1] = face_to_relocate.x;
                    tmp_faces_copy[offset+2] = face_to_relocate.y;
                }
                subgroupBarrier(); // Let's make sure that copied_faces_idx has been incremented by all threads.
                if(lID==0){
                    debugPrintfEXT("Finish! %d",copied_faces_idx);
                    save_copied_face_count_to_buffer(copied_faces_idx);
                }
                return; 
            }
            face_offset = beginning[part] + lID;
            face_to_relocate = faces[face_offset];
        }
        i++;
        if(i==removed_face_count||shared_faces_to_be_removed[i]==face_to_relocate.x){
            remove_face(face_offset, i);
            debugPrintfEXT("remove_face: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
            face_offset+=GROUP_SIZE;
            face_to_relocate = faces[face_offset];
            i=-1;
        }
    }
}

Basically what this code does is equivalent to

outer1:for(every face X in polygon beginning){
   for(every face Y to be removed from polygons){
      if(X==Y){
         remove_face(X);
         continue outer1;
      }
   } 
}
outer2:for(every face X in polygon ending){
   for(every face Y to be removed from polygons){
      if(X==Y){
         remove_face(X);
         continue outer2;
      }
   } 
}
for(every face Z to be inserted in the middle of polygon){
   insertFace(Z);
}
save_copied_face_count_to_buffer(number_of_faces_copied_along_the_way);

The reason why my code looks so convoluted is because I wrote it in a way that is more parallelizable and tries to minimize the number of inactive threads (considering that usually threads in the same subgroup have to execute the same instruction).

I also added a bunch more of debug prints and one more barrier just to see what happens. Here are the logs that i got

Stop 1: 0 0
Stop 1: 0 1
Stop 1: 0 2
Stop 1: 0 3
Stop 2: 0 0
Stop 2: 0 1
Stop 2: 0 2
Stop 2: 0 3
Stop 3: 0 2
Stop 3: 0 3
Stop 4: 0 2
Stop 4: 0 3
Stop 5: 0 2
Stop 5: 0 3
remove_face: 0 0
Stop 3: 0 0
Stop 4: 0 0
Stop 5: 0 0
Finish! 3   // at this point value 3 is saved (which is the wrong value)
remove_face: 0 1
Stop 3: 0 1
Stop 4: 0 1
Stop 5: 0 1 // at this point atomic is incremented and becomes 4 (which is the correct value)

score 1 · Accepted Answer

我找到了我的代码不起作用的原因。所以事实证明我误解了如何确切地subgroupBarrier()决定要同步哪些线程。如果线程处于非活动状态，则它不会参与屏障。不活动的线程稍后是否会变为活动并最终到达屏障并不重要。

这两个循环是不等价的（尽管看起来它们是等价的）

while(true){
   if(end_condition){
      break;
   }
}
subgroupBarrier();
some_function();

和

while(true){
   if(end_condition){
      subgroupBarrier();
      some_function();
      return;
   }
}

如果所有线程在完全相同的迭代中达到结束条件，则没有问题，因为所有线程同时处于活动状态。

当不同的线程可能在不同的迭代中退出循环时，就会出现此问题。如果线程 A 在 2 次迭代后通过了结束条件，而线程 B 在 3 次迭代后通过了结束条件，那么当 A 处于非活动状态并等待 B 完成时，它们之间将有一个完整的迭代。

在第一种情况下，A 将首先到达 break，然后 B 将到达 break 第二，最后两个线程将退出循环并到达屏障。

在第二种情况下，A 将首先到达结束条件并执行 if 语句，而 B 将处于非活动状态，等待 A 完成。当 A 到达屏障时，它将是该时间点唯一的活动线程，因此它将通过屏障而不与 B 同步。然后 A 将完成执行 if 语句的主体到达返回并变为非活动状态。然后 B 实际上将再次变得活跃并完成执行它的迭代。然后在下一次迭代中，它将达到结束条件和屏障，并且 ti 将再次成为唯一的活动线程，因此屏障不必同步任何内容。

gpgpu - Vulkan subgroupBarrier does not synchronize invokations

1 回答 1

Related

Reference