opengl - 使用 OpenGL 和 GLSL 的 SSAO 算法的奇怪性能行为

Question

我正在使用Oriented-Hemisphere渲染技术研究 SSAO（屏幕空间环境遮挡）算法。

一）算法

该算法需要作为输入：

1 个包含预计算样本的数组（在主循环之前加载 -> 在我的示例中，我使用了64 个根据 z 轴定向的样本）。
1 个包含归一化旋转矢量的噪声纹理，也根据 z 轴定向（此纹理生成一次）。
来自 GBuffer 的 2 个纹理：“PositionSampler”和“NormalSampler”，包含视图空间中的位置和法线向量。

这是我使用的片段着色器源代码：

#version 400

/*
** Output color value.
*/
layout (location = 0) out vec4 FragColor;

/*
** Vertex inputs.
*/
in VertexData_VS
{
    vec2 TexCoords;

} VertexData_IN;

/*
** Inverse Projection Matrix.
*/
uniform mat4 ProjMatrix;

/*
** GBuffer samplers.
*/
uniform sampler2D PositionSampler;
uniform sampler2D NormalSampler;

/*
** Noise sampler.
*/
uniform sampler2D NoiseSampler;

/*
** Noise texture viewport.
*/
uniform vec2 NoiseTexOffset;

/*
** Ambient light intensity.
*/
uniform vec4 AmbientIntensity;

/*
** SSAO kernel + size.
*/
uniform vec3 SSAOKernel[64];
uniform uint SSAOKernelSize;
uniform float SSAORadius;

/*
** Computes Orientation matrix.
*/
mat3 GetOrientationMatrix(vec3 normal, vec3 rotation)
{
    vec3 tangent = normalize(rotation - normal * dot(rotation, normal)); //Graham Schmidt process 
    vec3 bitangent = cross(normal, tangent);

    return (mat3(tangent, bitangent, normal)); //Orientation according to the normal
}

/*
** Fragment shader entry point.
*/
void main(void)
{
    float OcclusionFactor = 0.0f;

    vec3 gNormal_CS = normalize(texture(
        NormalSampler, VertexData_IN.TexCoords).xyz * 2.0f - 1.0f); //Normal vector in view space from GBuffer
    vec3 rotationVec = normalize(texture(NoiseSampler,
        VertexData_IN.TexCoords * NoiseTexOffset).xyz * 2.0f - 1.0f); //Rotation vector required for Graham Schmidt process

    vec3 Origin_VS = texture(PositionSampler, VertexData_IN.TexCoords).xyz; //Origin vertex in view space from GBuffer
    mat3 OrientMatrix = GetOrientationMatrix(gNormal_CS, rotationVec);

    for (int idx = 0; idx < SSAOKernelSize; idx++) //For each sample (64 iterations)
    {
        vec4 Sample_VS = vec4(Origin_VS + OrientMatrix * SSAOKernel[idx], 1.0f); //Sample translated in view space

        vec4 Sample_HS = ProjMatrix * Sample_VS; //Sample in homogeneus space
        vec3 Sample_CS = Sample_HS.xyz /= Sample_HS.w; //Perspective dividing (clip space)
        vec2 texOffset = Sample_CS.xy * 0.5f + 0.5f; //Recover sample texture coordinates

        vec3 SampleDepth_VS = texture(PositionSampler, texOffset).xyz; //Sample depth in view space

        if (Sample_VS.z < SampleDepth_VS.z)
            if (length(Sample_VS.xyz - SampleDepth_VS) <= SSAORadius)
                OcclusionFactor += 1.0f; //Occlusion accumulation
    }
    OcclusionFactor = 1.0f - (OcclusionFactor / float(SSAOKernelSize));

    FragColor = vec4(OcclusionFactor);
    FragColor *= AmbientIntensity;
}

这是结果（没有模糊渲染通道）：

直到这里一切似乎都是正确的。

二）表现

我注意到NSight Debugger关于性能的一个非常奇怪的行为：

如果我将相机越来越靠近巨龙，表演就会受到极大的影响。

但是，在我看来，情况并非如此，因为 SSAO 算法适用于屏幕空间，并且不依赖于例如龙的基元数量。

这是 3 个具有 3 个不同相机位置的屏幕截图（在这 3 个案例中，所有 1024*768 像素着色器都使用相同的算法执行）：

a) GPU 空闲：40%（受影响的像素：100%）

b) GPU 空闲：25%（受影响的像素：100%）

c) GPU 空闲：2%！（受影响的像素：100%）

我的渲染引擎在我的示例中使用了 2 个渲染通道：

Material Pass（填充位置和普通采样器）
环境通道（填充 SSAO 纹理）

我认为问题来自于添加这两个通道的执行，但事实并非如此，因为我在客户端代码中添加了一个条件，即如果相机静止，则不计算材质通道。因此，当我拍摄上面这 3 张照片时，只执行了 Ambient Pass。因此，这种缺乏性能与材料传递无关。我可以给你的另一个论点是，如果我删除龙网格（只有飞机的场景），结果是一样的：我的相机越靠近飞机，性能越差！

对我来说，这种行为是不合逻辑的！就像我上面说的，在这 3 种情况下，所有的像素着色器都是应用完全相同的像素着色器代码执行的！

现在，如果我直接在片段着色器中更改一小段代码，我注意到另一个奇怪的行为：

如果我替换该行：

FragColor = vec4(OcclusionFactor);

按行：

FragColor = vec4(1.0f, 1.0f, 1.0f, 1.0f);

性能不足消失了！

意思是如果SSAO代码执行正确（我尝试在执行过程中放置一些断点来检查）并且我最后没有使用这个OcclusionFactor来填充最终输出的颜色，所以不乏性能！

我认为我们可以得出结论，问题不在于“FragColor = vec4(OcclusionFactor);”行之前的着色器代码。... 我认为。

你怎么能解释这样的行为？

我在客户端代码和片段着色器代码中尝试了很多代码组合，但我找不到这个问题的解决方案！我真的迷路了。

非常感谢您的帮助！

score 5 · Accepted Answer

简短的回答是缓存效率。

为了理解这一点，让我们看一下内部循环中的以下几行：

    vec4 Sample_VS = vec4(Origin_VS + OrientMatrix * SSAOKernel[idx], 1.0f); //Sample translated in view space

    vec4 Sample_HS = ProjMatrix * Sample_VS; //Sample in homogeneus space
    vec3 Sample_CS = Sample_HS.xyz /= Sample_HS.w; //Perspective dividing (clip space)
    vec2 texOffset = Sample_CS.xy * 0.5f + 0.5f; //Recover sample texture coordinates

    vec3 SampleDepth_VS = texture(PositionSampler, texOffset).xyz; //Sample depth in view space

你在这里做的是：

平移视图空间中的原始点
将其转换为剪辑空间
采样纹理

那么这与缓存效率有何对应关系呢？

访问相邻像素时，缓存工作得很好。例如，如果您使用的是高斯模糊，则您只访问邻居，这些邻居很可能已经加载到缓存中。

因此，假设您的对象现在非常遥远。那么在裁剪空间中采样的像素也非常接近原始点->局部性高->缓存性能好。

如果相机非常靠近您的对象，则生成的样本点更远（在剪辑空间中）并且您将获得随机内存访问模式。尽管您实际上并没有进行更多操作，但这会大大降低您的性能。

编辑：

为了提高性能，您可以从前一通道的深度缓冲区重建视图空间位置。

如果您使用 32 位深度缓冲区，将一个样本所需的数据量从 12 字节减少到 4 字节。

位置重构如下所示：

vec4 reconstruct_vs_pos(vec2 tc){
  float depth = texture(depthTexture,tc).x;
  vec4 p = vec4(tc.x,tc.y,depth,1) * 2.0f + 1.0f; //tranformed to unit cube [-1,1]^3
  vec4 p_cs = invProj * p; //invProj: inverse projection matrix (pass this by uniform)
  return p_cs / p_cs.w;
}

score 1 · Accepted Answer

当您使用它时，您可以进行的另一项优化是以减小的大小渲染 SSAO 纹理，最好是主视口大小的一半。如果您这样做，请务必将您的深度纹理复制到另一个半尺寸纹理（glBlitFramebuffer）并从中采样您的位置。我希望这可以将性能提高一个数量级，尤其是在您给出的最坏情况下。

opengl - 使用 OpenGL 和 GLSL 的 SSAO 算法的奇怪性能行为

2 回答 2

Related

Reference