directx - 具有 numthreads (1,1,1) 的计算着色器运行速度极慢

Question

我刚刚开始学习 DirectX 编程，使用 F# 和 SharpDX 作为 .NET 包装器。作为一个测试用例，我渲染了 Mandelbrot 集。计算是使用 2 个计算着色器完成的。

第一个着色器计算每个像素的深度（函数“CalcMandel”），结果存储在 RWStructuredBuffer 中。这种计算需要大量的单乘或双乘，但它在我的 GPU (AMD 7790) 上速度非常快。“CalcMandel”具有属性

[numthreads(16, 16, 1)]

并通过

context.Dispatch (imageWidth / 16, imageHeight / 16, 1)

这里没有问题——“核心”Mandelbrot 集的 1000 x 800 像素图像以超过 1000 fps 的速度运行（在 GPU 上使用单精度）。

第二个着色器几乎什么都不做：它计算先前计算的最小值、最大值和平均值（函数“CalcMinMax”）。“CalcMinMax”具有属性

[numthreads(1, 1, 1)]

并通过调用

context.Dispatch (1,1,1)

对于给定的图像大小，单个 GPU 线程必须遍历缓冲区超过 800.000 个整数来计算最小值、最大值和平均值。我使用单线程，因为我不知道如何以并行方式实现此计算。

问题：“CalcMinMax”非常慢：帧速率从 1000 多帧下降到 5 帧！

我的问题：这里有什么问题？我是否使用了错误的设置/参数（numthreads）？如何加快 min-max 计算？

我的想法：我的第一个假设是访问 RWBuffer 可能很慢——事实并非如此。当我用常量替换缓冲区访问时，帧速率没有增加。

我的 GPU 有 appr。900 个着色器核心并使用数千个线程来计算 Mandelbrot 集，而“CalcMinMax”仅使用一个线程。然而，我仍然不明白为什么事情变得如此缓慢。

我将不胜感激任何建议！

=================================================

// HLSL CONTENT (Mandelbrot 集计算省略):

cbuffer cbCSMandel : register( b0 )
{

double  a0, b0, da, db;
double  ja0, jb0;   
int max_iterations;
bool julia;     int  cycle;
int width;      int height;
double colorFactor;
int algoIndex;
int step;
};


struct statistics
{
  int   minDepth;
  int     axDepth;
  float avgDepth;
  int   loops;
};

RWStructuredBuffer<float4> colorOutputTable :   register (u0);
StructuredBuffer<float4> output2 :          register (t0);
RWStructuredBuffer<int> counterTable :          register (u1);
RWStructuredBuffer<float4> colorTable :     register (u2);

RWStructuredBuffer<statistics>statsTable :      register (u3);


// Mandelbrot calculations….
// Results are written to counterTable and colorOutputTable


// I limit the samples to 10000 pixels because calcMinMax is too slow
#define NUM_SAMPLES 10000;

void calcMinMax() 
{
    int minDepth = 64000;
    int maxDepth = 0;
    int len = width * height;
    int crit = len / NUM_SAMPLES;
    int steps = max (crit, 1);
    int index = 0;          
    int sumCount = 0;
    float sum = 0.0;

while (index < len)
{
    int cnt = counterTable[index];
    minDepth = cnt < minDepth & cnt > 0 ? cnt : minDepth;
    maxDepth = cnt > maxDepth ? cnt : maxDepth;
    sum += cnt > 0 ? cnt : 0.0f;
sumCount += cnt > 0 ? 1 : 0; 
    index += steps;
}

statsTable[0].minDepth = minDepth;
statsTable[0].maxDepth = maxDepth; 
statsTable[0].avgDepth = sum / sumCount;
statsTable[0].loops += 1; 
}


[numthreads(1, 1, 1)]
void CalcMinMax ( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid :    SV_GroupThreadID, uint GI : SV_GroupIndex )

{
    switch (GI)     // this switch is used to verify GI number (always 0)
{
        case 0: calcMinMax();
    break;

        default: ;
    break;

}
}

// * ** * ** * ** * ** * ** * ** * F# 程序（最小-最大部分） * ** * ** * ** * ** *

着色器设置：

use minMaxShaderCode = ShaderBytecode.CompileFromFile(shaderPath, "CalcMinMax", "cs_5_0")                                                                
minMaxShader <- new ComputeShader(device, minMaxShaderCode.Bytecode.Data  )

着色器用法：

// ---------- CONNECT MinMap Shader
context.ComputeShader.Set(minMaxShader)    
context.ComputeShader.SetUnorderedAccessView(STATS_SLOT, statsBuffer.BufferView) 

context.ComputeShader.SetConstantBuffer(CONSTANT_SLOT, constantBuffer)
context.ComputeShader.SetUnorderedAccessView (COUNTER_SLOT, dataBuffer.BufferView)  
context.Dispatch (1,1,1)

// ---------- DISCONNECT MinMap Shader            
context.ComputeShader.SetConstantBuffer(CONSTANT_SLOT, null)
context.ComputeShader.SetUnorderedAccessView (STATS_SLOT, null) 
context.ComputeShader.SetUnorderedAccessView (COUNTER_SLOT, null) 
context.ComputeShader.Set (null)

阅读统计：

context.CopyResource(statsBuffer.DataBuffer, statsBuffer.StagingBuffer)
let boxer, stream = context.MapSubresource(statsBuffer.StagingBuffer, MapMode.Read, MapFlags.None)                                                                                                                                    
calcStatistics <- stream.Read<CalcStatistics>()
context.UnmapSubresource(statsBuffer.DataBuffer, 0)

score 8 · Accepted Answer

如果您仅调度 1 个线程，则除 GPU 上的一个以外的每个着色器单元都将处于空闲状态，等待该线程完成。您需要并行化您的 minmax 算法，并且考虑到您必须计算一组值才能得出一个值，这是一个典型的 reduce 问题。更有效的方法是递归计算局部最小值/最大值。可以在此处查看包含对数组值求和的示例的详细说明（从幻灯片 19 开始）。

score 1 · Accepted Answer

非常感谢您的反馈。我的问题得到了回答。

在他的回复中，akhanubis 分享了一个 PDF 文档的链接，该文档描述了 GPU 上的 map-reduce 问题。在我在 stackoverflow 中发布我的问题之前，我在互联网上进行了广泛的搜索——我已经找到了这篇论文并阅读了两遍！

为什么我还是没抓住重点？在最坏的情况下，在 4M 的数组上减少 8 毫秒的时间对我来说似乎是可以接受的（只有 800,000 点）。但是我没有意识到即使是演示中最坏的情况也比我的单线程方法至少快 100 倍，因为它使用了 128 个线程组。

我将使用本文中的概念来实现我的 min-max-calculation 的多线程版本。

directx - 具有 numthreads (1,1,1) 的计算着色器运行速度极慢

2 回答 2

Related

Reference