0

It is not a strictly programming question, but knowing hardware is the critical part of programming.

So I start this thread, hope people here can share their experience in programming on Kepler (GK10X or GK110).

First I start mine:

I am doing some programming on GK110 at the moment, for some applications, GK110 is significantly faster than a Fermi, close to its theoretical peak (e.g. 2.5-3X faster). But for others, it isn't, (e.g. only ~ 50%-60% faster).

Correct me if I am wrong, but it seems to me, the main performance bottleneck of Kepler is resource pressure is very high here:

On a per-SM level, Fermi actually have far more resources comparing to GK110, on each SM, Fermi only has one SIMT unit, whilst Kepler has 6.

Yet on each SM, Fermi has 32K registers file, a maximum of 1536 active threads, whilst on each SM of Kepler, there are only 33% more active threads, 100% more registers, with 800% insturction-issue units, and same amount of L1 cache.

The latencies regarding memory and computation are about the same in absolute terms (half in terms of GPU cycles).

So resource-pressure is much higher on GK110, comparing to GF110.

With 800% of instruction-issue units, it seems that Nvidia want to use more aggressive TLP and ILP to hide latency on Kepler, but it is certainly not as flexible, since L1 cache is the same, and active threads is only increased by 33% instead of 500% like its SIMT units.

So, to utilize maximum efficiency of Kepler, it is much harder, first code has to contain significantly more ILP yet take significantly less shared-memory to take the advantage of a massive instruction-issue unit of Kepler, secondly, on a per-warp level, workflow has to be very computationally intensive such that Kepler schedule don't need to switch a lot of warps to hide latency (and it certainly don't have a lot of available warps to chose from to begin with).

4

1 回答 1

2

您可能想阅读 Kepler (GK110)白皮书,或者与 Fermi白皮书进行比较,然后学习 Kepler调优指南。调优指南将回答您关于 Kepler 的差异以及如何充分利用 Kepler 的许多问题。

于 2013-03-17T21:57:37.583 回答