c++ - 提高 OpenGL 中的 VBO 性能

Question

所以我目前正在尝试以不错的速度渲染复杂的模型，但遇到了一些麻烦；渲染单个模型会导致我的帧速率变得紧张，而在程序中没有任何额外的工作。我的模型（场景中只有一个）似乎太大了。我上传到缓冲区的顶点数组中有 444384 个浮点数（因此模型中有 24688 个三角形）。

//Create vertex buffers
glGenBuffers(1, &m_Buffer);
glBindBuffer(GL_ARRAY_BUFFER, m_Buffer);    
int SizeInBytes = m_ArraySize * 6 * sizeof(float);
glBufferData(GL_ARRAY_BUFFER, SizeInBytes, NULL, GL_DYNAMIC_DRAW);

//Upload buffer data
glBufferSubData(GL_ARRAY_BUFFER, 0, sizeof(float) * VertArray.size(), &VertArray[0]);

我知道 VBO 的大小是有区别的，因为 A）减小大小可以提高性能，B）注释掉渲染代码：

glPushMatrix();

//Translate
glTranslatef(m_Position.x, m_Position.y, m_Position.z);

glMultMatrixf(m_RotationMatrix);

//Bind buffers for vertex and index arrays
glBindBuffer(GL_ARRAY_BUFFER, m_Buffer);

glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_FLOAT, 6 * sizeof(float), 0);
glEnableClientState(GL_NORMAL_ARRAY);
glNormalPointer(GL_FLOAT, 6 * sizeof(float), (void*)12);

//Draw
glDrawArrays(GL_TRIANGLES, 0, m_ArraySize);

glDisableClientState(GL_VERTEX_ARRAY);
glDisableClientState(GL_NORMAL_ARRAY);

//Unbind the buffers
glBindBuffer(GL_ARRAY_BUFFER, 0);

glPopMatrix();

给我留下大约 2000-2500 FPS，而取消注释此代码会使我下降到大约 130FPS，或 8 毫秒/帧（仅此一项就足够了，但我还需要能够在程序中做其他事情，其中一些这可能是 CPU 密集型的）。具有 85k 个三角形的更复杂的模型将其降低到 50 FPS 或大约 20 毫秒/帧，此时程序明显卡顿。

在这一点上，我使用的一对着色器非常少，我怀疑这就是问题所在。他们在这里，以防万一；首先是顶点着色器：

void main()
{
    vec3 normal, lightDir;
    vec4 diffuse;
    float NdotL;
    /* first transform the normal into eye space and normalize the result */

    normal = normalize(gl_NormalMatrix * gl_Normal);
    /* now normalize the light's direction. Note that according to the

    OpenGL specification, the light is stored in eye space. Also since
    we're talking about a directional light, the position field is actually
    direction */
    lightDir = normalize(vec3(gl_LightSource[0].position));
    /* compute the cos of the angle between the normal and lights direction.

    The light is directional so the direction is constant for every vertex.
    Since these two are normalized the cosine is the dot product. We also
    need to clamp the result to the [0,1] range. */
    NdotL = max(dot(normal, lightDir), 0.0);
    /* Compute the diffuse term */

    diffuse = gl_FrontMaterial.diffuse * gl_LightSource[0].diffuse;
    gl_FrontColor =  NdotL * diffuse;

    gl_Position = ftransform();
}

和片段着色器：

void main()
{
    gl_FragColor = gl_Color;
}

我正在使用GTX 660M作为我的显卡运行程序。

现在据我所知，VBO是OpenGL中渲染大量多边形最快的方法，而且网上似乎建议很多机器可以同时计算和显示数百万个多边形，所以我相信一定有办法优化我相对微不足道的 27k 三角形的渲染。我宁愿现在就这样做，也不愿将来重写和重组大量代码。

我启用了背面剔除；我不确定 fustrum 剔除会有所帮助，因为有时，所有或大部分模型都在屏幕上（我目前剔除对象，但不是单个对象中的三角形）。剔除视口中不面向相机的面可能会有所帮助，但我不知道该怎么做。除此之外，我不确定如何优化渲染。我还没有实现顶点缓冲区，但我读过这可能只会将速度提高 10% 左右。

人们如何以可接受的帧速率同时在屏幕上实现数万或数十万个三角形，同时进行其他操作？我可以做些什么来提高我的 VBO 渲染的性能？

更新：根据下面的评论，我只画了一半的数组，如下所示：

glDrawArrays(GL_TRIANGLES, 0, m_ArraySize/2);

然后是数组的四分之一：

glDrawArrays(GL_TRIANGLES, 0, m_ArraySize/4);

减少每次绘制的数组数量实际上使速度翻了一番（分别从 12 毫秒到 6 毫秒和 3 毫秒），但模型完全完好无损——没有任何遗漏。这似乎表明我在其他地方做错了什么，但我不知道是什么；我相当有信心在构建模型时不会添加 4 次以上相同的三角形，那还能是什么呢？我可能会以某种方式多次上传缓冲区吗？

score 3 · Accepted Answer

glDrawArrays()将要绘制的索引数作为其第三个参数。您传入交错顶点和法线数组中的浮点数，这是索引数的 6 倍。GPU 滞后是因为您告诉它访问缓冲区边界之外的数据——现代 GPU 在发生这种情况时会触发故障，旧的 GPU 只会使您的系统崩溃:)

考虑以下交错数组：

vx0 vy0 vz0 nx0 ny0 nz0 vx1 vy1 vz1 nx1 ny1 nz1 vx2 vy2 vz2 nx2 ny2 nz2

该数组包含三个顶点和三个法线（一个三角形）。绘制三角形需要三个顶点，因此您需要三个索引来选择它们。要绘制上述三角形，您将使用：

glDrawArrays(GL_TRIANGLES, 0, 3);

属性的工作方式（顶点、法线、颜色、纹理等），单个索引从每个属性中选择一个值。如果您在上面的三角形中添加颜色属性，您仍将只使用 3 个索引。

score 3 · Accepted Answer

I think the problem is that each triangle in your model, has its own three vertices. You're not using indexed triangles (GL_ELEMENT_ARRAY_BUFFER, glDrawElements) so that it's possible for vertex data to be shared between triangles.

From what I can tell, there are two issues with your current approach.

The sheer amount of data that needs to be processed (although this can be a problem with indexed triangles as well).
When using glDrawArrays() as opposed to glDrawElements, the GPU cannot make use of the post-transform cache, which is used to reduce the amount of vertex processing.

If possible, re-arrange your data to use indexed triangles.

I'll just add the caveat that if you use indexed triangles, you have to make sure that you're sharing vertex data between triangles as much as possible to get the best performance. It's really about how well you organise your data.

score 3 · Accepted Answer

编辑：阅读一些评论；下面的回复。

一些随机的事情要尝试。

glBufferData(GL_ARRAY_BUFFER, SizeInBytes, NULL, GL_DYNAMIC_DRAW);

试试GL_STATIC_DRAW。在稳定状态下它可能无济于事（因为驱动程序应该注意到不需要重新上传，因为没有修改顶点缓冲区数据），但值得一试。

glDisableClientState(GL_VERTEX_ARRAY);
glDisableClientState(GL_NORMAL_ARRAY);

//Unbind the buffers
glBindBuffer(GL_ARRAY_BUFFER, 0);

如果不需要，不要在每次绘制后更改顶点缓冲区状态。它只是一个缓冲区，让它绑定。

   normal = normalize(gl_NormalMatrix * gl_Normal);
    /* now normalize the light's direction. Note that according to the

    OpenGL specification, the light is stored in eye space. Also since
    we're talking about a directional light, the position field is actually
    direction */
    lightDir = normalize(vec3(gl_LightSource[0].position));
    /* compute the cos of the angle between the normal and lights direction.

    The light is directional so the direction is constant for every vertex.
    Since these two are normalized the cosine is the dot product. We also
    need to clamp the result to the [0,1] range. */
    NdotL = max(dot(normal, lightDir), 0.0);

您实际上可以稍微优化一下并节省一个normalize()（因此是半昂贵的invsqrt）。请注意，对于向量v1和v2，以及标量s1和s2：

dot(v1 * s1, v2 * s2) == s1 * s2 * dot(v1, v2);

因此，如果v1和v2未归一化，您可以将它们的平方大小分解，将它们相乘，然后乘以组合invsqrt一次以缩小它们的点积。

85k 个三角形，大约 50 FPS？使用 GTX660M 我会说你做得很好。我不希望在您正在运行的硬件上获得更高的数字。

至于固定功能流水线——现在所有酷孩子都在使用完全可编程的流水线。但是 FF 不会失去你的性能——在内部，驱动程序将 FF 状态编译成一组着色器，因此无论如何它都作为着色器在 GPU 上执行。

正如@JamesSteele 提到的，如果您可以在顶点数据中保持良好的参考位置，索引三角形是一个好主意。不过，这可能需要重新编译或以其他方式重新调整您的输入数据。

c++ - 提高 OpenGL 中的 VBO 性能

3 回答 3

Related

Reference