performance - GLSL 性能 - 函数返回值/类型

Question

我正在使用双三次过滤来平滑我的高度图，我在 GLSL 中实现了它：

双三次插值：（见interpolate()下面的函数）

float interpolateBicubic(sampler2D tex, vec2 t) 
{

vec2 offBot =   vec2(0,-1);
vec2 offTop =   vec2(0,1);
vec2 offRight = vec2(1,0);
vec2 offLeft =  vec2(-1,0);

vec2 f = fract(t.xy * 1025);

vec2 bot0 = (floor(t.xy * 1025)+offBot+offLeft)/1025;
vec2 bot1 = (floor(t.xy * 1025)+offBot)/1025;
vec2 bot2 = (floor(t.xy * 1025)+offBot+offRight)/1025;
vec2 bot3 = (floor(t.xy * 1025)+offBot+2*offRight)/1025;

vec2 mbot0 = (floor(t.xy * 1025)+offLeft)/1025;
vec2 mbot1 = (floor(t.xy * 1025))/1025;
vec2 mbot2 = (floor(t.xy * 1025)+offRight)/1025;
vec2 mbot3 = (floor(t.xy * 1025)+2*offRight)/1025;

vec2 mtop0 = (floor(t.xy * 1025)+offTop+offLeft)/1025;
vec2 mtop1 = (floor(t.xy * 1025)+offTop)/1025;
vec2 mtop2 = (floor(t.xy * 1025)+offTop+offRight)/1025;
vec2 mtop3 = (floor(t.xy * 1025)+offTop+2*offRight)/1025;

vec2 top0 = (floor(t.xy * 1025)+2*offTop+offLeft)/1025;
vec2 top1 = (floor(t.xy * 1025)+2*offTop)/1025;
vec2 top2 = (floor(t.xy * 1025)+2*offTop+offRight)/1025;
vec2 top3 = (floor(t.xy * 1025)+2*offTop+2*offRight)/1025;

float h[16];

h[0] = texture(tex,bot0).r;
h[1] = texture(tex,bot1).r;
h[2] = texture(tex,bot2).r;
h[3] = texture(tex,bot3).r;

h[4] = texture(tex,mbot0).r;
h[5] = texture(tex,mbot1).r;
h[6] = texture(tex,mbot2).r;
h[7] = texture(tex,mbot3).r;

h[8] = texture(tex,mtop0).r;
h[9] = texture(tex,mtop1).r;
h[10] = texture(tex,mtop2).r;
h[11] = texture(tex,mtop3).r;

h[12] = texture(tex,top0).r;
h[13] = texture(tex,top1).r;
h[14] = texture(tex,top2).r;
h[15] = texture(tex,top3).r;

float H_ix[4];

H_ix[0] = interpolate(f.x,h[0],h[1],h[2],h[3]);
H_ix[1] = interpolate(f.x,h[4],h[5],h[6],h[7]);
H_ix[2] = interpolate(f.x,h[8],h[9],h[10],h[11]);
H_ix[3] = interpolate(f.x,h[12],h[13],h[14],h[15]);

float H_iy = interpolate(f.y,H_ix[0],H_ix[1],H_ix[2],H_ix[3]);

return H_iy;
}

这是我的版本，纹理大小（1025）仍然是硬编码的。在顶点着色器和/或曲面细分评估着色器中使用它会严重影响性能（20-30fps）。但是当我将此函数的最后一行更改为：

return 0;

就像我使用双线性或最近/不过滤一样，性能会提高。

同样的情况发生在：（我的意思是性能仍然很好）

return h[...]; //...
return f.x; //...
return H_ix[...]; //...

插值函数：

float interpolate(float x, float v0, float v1, float v2,float v3)
{
    double c1,c2,c3,c4; //changed to float, see EDITs

    c1 = spline_matrix[0][1]*v1;
    c2 = spline_matrix[1][0]*v0 + spline_matrix[1][2]*v2;
    c3 = spline_matrix[2][0]*v0 + spline_matrix[2][1]*v1 + spline_matrix[2][2]*v2 + spline_matrix[2][3]*v3;
    c4 = spline_matrix[3][0]*v0 + spline_matrix[3][1]*v1 + spline_matrix[3][2]*v2 + spline_matrix[3][3]*v3;

    return(c4*x*x*x + c3*x*x +c2*x + c1);
};

仅当我返回最终H_iy值时，fps 才会降低。返回值如何影响性能？

编辑我刚刚意识到我double在interpolate()函数中使用了声明c1，c2...等。我已将其更改为float，现在性能仍然很好，返回值正确。所以问题有点变化：

精度变量如何double影响硬件的性能，为什么其他插值函数没有触发这种性能损失，只有最后一个，因为H_ix[]数组float也是，就像H_iy？

score 3 · Accepted Answer

您可以通过硬件使用双线性插值来获得优势。双三次插值基本上可以写成双线性插值输入点的双线性插值。像这样：

uniform sampler2D texture;
uniform sampler2D mask;
uniform vec2 texOffset;
varying vec4 vertColor;
varying vec4 vertTexCoord;
void main() {
  vec4 p0 = texture2D(texture, vertTexCoord.st).rgba;
  vec2 d  = texOffset * 0.125;
  vec4 p1 = texture2D(texture, vertTexCoord.st+vec2( d.x, d.y)).rgba;
  vec4 p2 = texture2D(texture, vertTexCoord.st+vec2(-d.x, d.y)).rgba;
  vec4 p3 = texture2D(texture, vertTexCoord.st+vec2( d.x,-d.y)).rgba;
  vec4 p4 = texture2D(texture, vertTexCoord.st+vec2(-d.x,-d.y)).rgba;
  gl_FragColor = (  2.0*p0   + p1 + p2 + p3 + p4)/6.0;
 }

这就是结果

第一个图像是标准硬件插值
第二张图片是使用上面代码的双三次插值
相同的双三次插值，但颜色离散化以查看轮廓线

第一个图像

score 2 · Accepted Answer

texelFetch()您可以做的一件事是使用而不是floor()/来加快速度texture()，因此硬件不会浪费时间进行任何过滤。虽然硬件过滤非常快，这也是我链接gpu gems文章的部分原因。现在还有一个textureSize()函数可以保存自己传递的值。

GLSL 有一个非常激进的优化器，它会丢弃所有可能的东西。所以假设你花了很长时间计算一个非常昂贵的光照值，但最后只是说colour = vec4(1)，你所有的计算都被忽略了，它运行得非常快。在尝试对事物进行基准测试时，这可能需要一些时间来适应。我相信这是您在返回不同值时看到的问题。想象一下，每个变量都有一个依赖树，如果输出中没有使用任何变量，包括制服和属性，甚至跨着色器阶段，GLSL 会完全忽略它。我在这里看到 GLSL 编译器不足的一个地方是在不需要时复制输入/输出函数参数。

至于双精度，这里有一个类似的问题：https ://superuser.com/questions/386456/why-does-a-geforce-card-perform-4x-slower-in-double-precision-than-a-特斯拉卡。一般来说，图形需要快速并且几乎总是只使用单精度。对于更通用的计算应用程序，例如科学模拟，双精度数当然会提供更高的精度。您可能会发现更多与 CUDA 相关的信息。

performance - GLSL 性能 - 函数返回值/类型

2 回答 2

Related

Reference