我正在编写一个 OpenGL3 2D 引擎。目前,我正在尝试解决瓶颈。因此请提供 AMD Profiler 的以下输出:http: //h7.abload.de/img/profilerausa.png


但是,在 50.000 个精灵时,testapp 在 5 fps 时已经无法使用。

这表明,我的瓶颈是我使用的变换函数。那就是对应的函数: http ://code.google.com/p/nightlight2d/source/browse/NightLightDLL/NLBoundingBox.cpp#130

void NLBoundingBox::applyTransform(NLVertexData* vertices) 
    if ( needsTransform() )
            // Apply Matrix
            for ( int i=0; i<6; i++ )
                glm::vec4 transformed = m_rotation * m_translation * glm::vec4(vertices[i].x, vertices[i].y, 0, 1.0f);
                vertices[i].x = transformed.x;
                vertices[i].y = transformed.y;
            m_translation = glm::mat4(1);
            m_rotation    = glm::mat4(1);
            m_needsTransform = false;

我不能在着色器中这样做,因为我一次批处理所有精灵。这意味着,我必须使用 CPU 来计算转换。


我不使用任何线程 atm,所以当我使用 vsync 时,我也会受到额外的性能影响,因为它会等待屏幕完成。这告诉我我应该使用线程。

另一种方法是使用 OpenCL 也许?我想避免使用 CUDA,因为据我所知它只能在 NVIDIA 卡上运行。那正确吗?




请注意,这需要安装 VC++2008,因为它是用于运行分析器的调试版本。


4 回答 4


The first thing I would do is concatenate your rotation and transform matricies into one matrix before you enter the for-loop ... that way you aren't calculating two matrix multiplications and a vector on every for-loop; instead you would only be multiplying a single vector and matrix. Secondly, you may want to look into unrolling your loop and then compiling with a higher optimization level (on g++ I would use at least -O2, but I'm not familiar with MSVC, so you'll have to translate that optimization level yourself). That would avoid any overhead that branches in the code might incur, especially on cache-flushes. Lastly, if you haven't already looked into it, check into doing some SSE optimizations since you're dealing with vectors.

UPDATE: I'm going to add one last idea that would involve threading ... basically pipeline your vertices when you do your threading. So for instance, let's say you have a machine with eight available CPU threads (i.e., quad-core with hyper-threading). Setup six threads for the vertex pipeline processing, and use non-locking single-consumer/producer queues to pass messages between stages of the pipeline. Each stage will transform a single member of your six-member vertex-array. I'm guessing there are a bunch of these six-member vertex arrays, so setup in a stream that is passed through the pipeline, you can very efficiently process the stream, and avoid the use of mutexes and other locking semaphores, etc. For more info on a fast non-locking single-producer/consumer queue, see my answer here.

UPDATE 2: You only have a dual-core processor ... so dump the pipeline idea since it's going to run into bottlenecks as each thread contends for CPU resources.

于 2011-08-02T19:12:15.107 回答

我不能在着色器中这样做,因为我一次批处理所有精灵。这意味着,我必须使用 CPU 来计算转换。


您需要做的不是减少批次。您需要有正确数量的批次。当您放弃 GPU 顶点变换而转而支持 CPU 变换时,您知道您在批处理方面走得太远了。

正如 Datenwolf 建议的那样,您需要进行一些实例化才能将转换恢复到 GPU 上。但即便如此,您也需要撤消这里的一些过度批处理。您还没有过多地谈论您正在渲染什么样的场景(顶部带有精灵的贴图,大型粒子系统等),因此很难知道该建议什么。

此外,GLM 是一个很好的数学库,但它的设计并不是为了获得最佳性能。如果我需要每帧在 CPU 上转换 300,000 个顶点,我通常不会使用它。

于 2011-08-02T22:17:06.493 回答

如果你坚持在 CPU 上进行计算,你应该自己做数学。

现在,您在 2D 环境中使用 4x4 矩阵,其中一个用于旋转的 2x2 矩阵和一个用于平移的简单向量就足够了。这是 4 次乘法和 4 次旋转加法,以及 2 次平移加法。


与那些 4x4 矩阵现在所做的操作相比,这要少得多。

于 2011-08-03T00:18:03.330 回答

循环内的分配可能是一个问题,不过我不熟悉这个库。将其移出 for 循环并手动进行字段分配可能会有所帮助。将转换移到循环之外也会有所帮助。



// Apply Matrix
glm::vec4 transformed;
glm::mat4 translation = m_rotation * m_translation;
for ( int i=0; i<6; i++ )
    transformed.x = vertices[i].x;
    transformed.y = vertices[i].y;
    transformed.z = vertices[i].z;
    transformed.w = 1.f; // ?
    /* I can't find docs, but assume they have an in-place multiply
    // */
    vertices[i].x = transformed.x;
    vertices[i].y = transformed.y;



glm::vec4 transformed[6];
for (size_t i = 0; i < 6; i++) {
    transformed[i].x = vertices[i].x;
    transformed[i].y = vertices[i].y;
    transformed[i].z = vertices[i].z;
    transformed.w = 1.f; // ?
glm::mat4 translation = m_rotation * m_translation;
for (size_t i = 0; i < 6; i++) {
    /* I can't find docs, but assume they have an in-place multiply
    // */
for (size_t i = 0; i < 6; i++) {
    vertices[i].x = transformed[i].x;
    vertices[i].y = transformed[i].y;

正如 Jason 所提到的,手动展开这些循环可能会很有趣。



当你在低级代码中有这样的高级问题时,你最终只是盲目地一遍又一遍地调用这个方法,以为它是免费的。无论您对 needsTransform 多久为真的假设可能是非常不正确的。

现实情况是,您应该只调用此方法一次。当你想应用变换时,你应该应用变换。当您可能想要 applyTransform 时,您不应该调用 applyTransform。接口应该是一个契约,这样对待它们。

于 2011-08-02T19:28:29.800 回答