I have a somewhat complex procedure that contains nested loop and a subgroupBarrier
.
In a simplified form it looks like
while(true){
while(some_condition){
if(end_condition){
atomicAdd(some_variable,1);
debugPrintfEXT("%d:%d",gl_SubgroupID,gl_SubgroupInvocationID.x);
subgroupBarrier();
if(gl_SubgroupInvocationID.x==0){
debugPrintfEXT("Finish! %d", some_variable);
// do some final stuff
}
return; // this is the only return in the entire procedure
}
// do some stuff
}
// do some stuff
}
Overall the procedure is correct and it does what's expected from it. All subgroup threads always eventually reach the end condition. However, in my logs I see
0:2
0:3
0:0
Finish! 3
0:1
And it's not just the matter of logs being displayed out of order. I perform atomic addition and it seems to be wrong too. I need all threads to finish all their atomic operations before printing Finish!
. If the subgroupBarrier()
worked correctly, it should print 4
, but in my case it prints 3
. I've been mostly following this tutorial
https://www.khronos.org/blog/vulkan-subgroup-tutorial
and it says that
void subgroupBarrier()
performs a full memory and execution barrier - basically when an invocation returns fromsubgroupBarrier()
we are guaranteed that every invocation executed the barrier before any return, and all memory writes by those invocations are visible to all invocations in the subgroup.
Interestingly I tried changing if(gl_SubgroupInvocationID.x==0)
to other numbers. For example if(gl_SubgroupInvocationID.x==3)
yields
0:2
0:3
Finish! 2
0:0
0:1
So it seems like the subgroupBarrier()
is entirely ignored.
Could the nested loop be the cause of the problem or is it something else?
Edit:
I provide here more detailed code
#version 450
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_EXT_debug_printf : enable
layout (local_size_x_id = GROUP_SIZE_CONST_ID) in; // this is a specialization constant whose value always matches the subgroupSize
shared uint copied_faces_idx;
void main() {
const uint chunk_offset = gl_WorkGroupID.x;
const uint lID = gl_LocalInvocationID.x;
// ... Some less important stuff happens here ...
const uint[2] ending = uint[2](relocated_leading_faces_ending, relocated_trailing_faces_ending);
const uint[2] beginning = uint[2](offset_to_relocated_leading_faces, offset_to_relocated_trailing_faces);
uint part = 0;
face_offset = lID;
Face face_to_relocate = faces[face_offset];
i=-1;
debugPrintfEXT("Stop 1: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
subgroupBarrier(); // I added this just to test see what happens
debugPrintfEXT("Stop 2: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
while(true){
while(face_offset >= ending[part]){
part++;
if(part>=2){
debugPrintfEXT("Stop 3: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
subgroupBarrier();
debugPrintfEXT("Stop 4: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
for(uint i=lID;i<inserted_face_count;i+=GROUP_SIZE){
uint offset = atomicAdd(copied_faces_idx,1);
face_to_relocate = faces_to_be_inserted[i];
debugPrintfEXT("Stop 5: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
tmp_faces_copy[offset+1] = face_to_relocate.x;
tmp_faces_copy[offset+2] = face_to_relocate.y;
}
subgroupBarrier(); // Let's make sure that copied_faces_idx has been incremented by all threads.
if(lID==0){
debugPrintfEXT("Finish! %d",copied_faces_idx);
save_copied_face_count_to_buffer(copied_faces_idx);
}
return;
}
face_offset = beginning[part] + lID;
face_to_relocate = faces[face_offset];
}
i++;
if(i==removed_face_count||shared_faces_to_be_removed[i]==face_to_relocate.x){
remove_face(face_offset, i);
debugPrintfEXT("remove_face: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
face_offset+=GROUP_SIZE;
face_to_relocate = faces[face_offset];
i=-1;
}
}
}
Basically what this code does is equivalent to
outer1:for(every face X in polygon beginning){
for(every face Y to be removed from polygons){
if(X==Y){
remove_face(X);
continue outer1;
}
}
}
outer2:for(every face X in polygon ending){
for(every face Y to be removed from polygons){
if(X==Y){
remove_face(X);
continue outer2;
}
}
}
for(every face Z to be inserted in the middle of polygon){
insertFace(Z);
}
save_copied_face_count_to_buffer(number_of_faces_copied_along_the_way);
The reason why my code looks so convoluted is because I wrote it in a way that is more parallelizable and tries to minimize the number of inactive threads (considering that usually threads in the same subgroup have to execute the same instruction).
I also added a bunch more of debug prints and one more barrier just to see what happens. Here are the logs that i got
Stop 1: 0 0
Stop 1: 0 1
Stop 1: 0 2
Stop 1: 0 3
Stop 2: 0 0
Stop 2: 0 1
Stop 2: 0 2
Stop 2: 0 3
Stop 3: 0 2
Stop 3: 0 3
Stop 4: 0 2
Stop 4: 0 3
Stop 5: 0 2
Stop 5: 0 3
remove_face: 0 0
Stop 3: 0 0
Stop 4: 0 0
Stop 5: 0 0
Finish! 3 // at this point value 3 is saved (which is the wrong value)
remove_face: 0 1
Stop 3: 0 1
Stop 4: 0 1
Stop 5: 0 1 // at this point atomic is incremented and becomes 4 (which is the correct value)