pixel-shader - 为什么 Cg 编译器的更高版本使用更多指令生成着色器？

Question

我有一个看起来像这样的着色器：

void main( in   float2              pos         : TEXCOORD0,
           in   uniform sampler2D   data        : TEXUNIT0,
           in   uniform sampler2D   palette     : TEXUNIT1,
           in   uniform float       c,
           in   uniform float       th0,
           in   uniform float       th1,
           in   uniform float       th2,
           in   uniform float4      BackGroundColor,
           out  float4              color       : COLOR
         )
{
    const float4 dataValue = tex2D( data, pos );
    const float vValue = dataValue.x;
    const float tValue = dataValue.y;

    color = BackGroundColor;
    if ( tValue <= th2 )
    {
        if ( tValue < th1 )
        {
            const float vRealValue = abs( vValue - 0.5 );
            if ( vRealValue > th0 )
            {
                // determine value and color
                const float power = ( c > 0.0 ) ? vValue : ( 1.0 - vValue );
                color = tex2D( palette, float2( power, 0.0 ) );
            }
        }
        else
        {
            color = float4( 0.0, tValue, 0.0, 1.0 );
        }
    }
}

我正在编译它：

cgc -profile arbfp1 -strict -O3 -q sh.cg -o sh.asm

现在，不同版本的 Cg 编译器创建不同的输出。

cgc 版本 2.2.0006 正在使用 18 条指令将着色器编译成汇编代码：

!!ARBfp1.0
PARAM c[6] = { program.local[0..4],{ 0, 1, 0.5 } };
TEMP R0;
TEMP R1;
TEMP R2;
TEX R0.xy, fragment.texcoord[0], texture[0], 2D;
ADD R0.z, -R0.x, c[5].y;
CMP R0.z, -c[0].x, R0.x, R0;
MOV R0.w, c[5].x;
TEX R1, R0.zwzw, texture[1], 2D;
SLT R0.z, R0.y, c[2].x;
ADD R0.x, R0, -c[5].z;
ABS R0.w, R0.x;
SGE R0.x, c[3], R0.y;
MUL R2.x, R0, R0.z;
SLT R0.w, c[1].x, R0;
ABS R2.y, R0.z;
MUL R0.z, R2.x, R0.w;
CMP R0.w, -R2.y, c[5].x, c[5].y;
CMP R1, -R0.z, R1, c[4];
MUL R2.x, R0, R0.w;
MOV R0.xzw, c[5].xyxy;
CMP result.color, -R2.x, R0, R1;
END
# 18 instructions, 3 R-regs

cgc 版本 3.0.0016 正在使用 23 条指令将着色器编译成汇编代码：

!!ARBfp1.0
PARAM c[6] = { program.local[0..4], { 0, 1, 0.5 } };
TEMP R0;
TEMP R1;
TEMP R2;
TEX R0.xy, fragment.texcoord[0], texture[0], 2D;
ADD R1.y, R0.x, -c[5].z;
MOV R1.z, c[0].x;
ABS R1.y, R1;
SLT R1.z, c[5].x, R1;
SLT R1.x, R0.y, c[2];
SGE R0.z, c[3].x, R0.y;
MUL R0.w, R0.z, R1.x;
SLT R1.y, c[1].x, R1;
MUL R0.w, R0, R1.y;
ABS R1.z, R1;
CMP R1.y, -R1.z, c[5].x, c[5];
MUL R1.y, R0.w, R1;
ADD R1.z, -R0.x, c[5].y;
CMP R1.z, -R1.y, R1, R0.x;
ABS R0.x, R1;
CMP R0.x, -R0, c[5], c[5].y;
MOV R1.w, c[5].x;
TEX R1, R1.zwzw, texture[1], 2D;
CMP R1, -R0.w, R1, c[4];
MUL R2.x, R0.z, R0;
MOV R0.xzw, c[5].xyxy;
CMP result.color, -R2.x, R0, R1;
END
# 23 instructions, 3 R-regs

奇怪的是，cg 3.0 的优化级别似乎没有任何影响。

有人可以解释发生了什么吗？为什么优化不起作用，为什么当我使用 cg 3.0 编译时着色器更长？

请注意，我从已编译的着色器中删除了注释。

score 1 · Accepted Answer

这可能不是问题的真正答案，但可能会提供更多见解。我稍微检查了生成的汇编代码并将其转换回高级代码。我试图尽可能地压缩它，并从高级操作中删除所有隐含的副本和临时文件。我将b变量用作临时布尔值，将fs 用作临时浮点数。第一个（带有 2.2 版本）是：

power = ( c > 0.0 ) ? vValue : ( 1.0 - vValue );
R1 = tex2D( palette, float2( power, 0.0 ) );

vRealValue = abs( vValue - 0.5 );

b1 = ( tValue < th1 );
b2 = ( tValue <= th2 );

b3 = b1;

b1 = b1 && b2 && ( vRealValue > th0 );
R1 = b1 ? R1 : BackGroundColor;

color = ( b2 && !b3 ) ? float4( 0.0, tValue, 0.0, 1.0 ) : R1;

第二个（3.0）是：

vRealValue = abs( vValue - 0.5 );

f0 = c;
b0 = ( 0 < f0 );

b1 = ( tValue < th1 );
b2 = ( tValue <= th2 );

b4 = b1 && b2 && ( vRealValue > th0 );

b0 = b0;
b3 = b1;

power = ( b4 && !b0 ) ? ( 1.0 - vValue ) : vValue;
R1 = tex2D( palette, float2( power, 0.0 ) );

R1 = b4 ? R1 : BackGroundColor;

color = ( b2 && !b3 ) ? float4( 0.0, tValue, 0.0, 1.0 ) : R1;

大多数部分基本相同。第二个程序做了一些不必要的操作。它将变量复制c到临时变量中，而不是直接使用它。此外它是否在功率计算中切换vValue和1-vValue，所以它需要取反b0（导致多了一个CMP），而第一个根本不使用临时变量（它CMP直接使用而不是SLTand CMP）。它也在b4这个计算中使用，这是完全不必要的，因为当b4为假时，纹理访问的结果是无关紧要的，无论如何。这会产生一个&&（用实现MUL）。还有不必要的副本 from b1tob3（在第一个程序中是必要的，但在第二个程序中不是）。以及极其无用的副本 from b0into 自身（伪装为 a ABS，但由于值来自 a SLT，它只能是 0.0 或 1.0 并且ABS退化为 a MOV）。

所以第二个程序与第一个程序非常相似，只是有一些额外的，但恕我直言，完全无用的指令。与以前的（！）版本相比，优化器似乎做得更差。由于 Cg 编译器是 nVidia 产品（而不是来自其他不具名的图形公司），这种行为真的很奇怪。

pixel-shader - 为什么 Cg 编译器的更高版本使用更多指令生成着色器？

1 回答 1

Related

Reference