我创建了三个合成 CUDA 内核,它们几乎都只做算术运算。所有三个内核都是相同的,只是它们中的每一个都执行不同数量的操作。内核 #1 执行 8 次操作,内核 #2 执行 16 次操作,内核 #3 执行 32 次。以下是所有三个的 CUDA 内核的实现。
内核#1:
#ifndef kernelWGSXMAPIXLLXOPS8_H_
#define kernelWGSXMAPIXLLXOPS8_H_
__global__ void WGSXMAPIXLLXOPS8 (const float *GIn, float *GOut, const float M, const float N, const float P) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
... // 8 FMA operations
temp1 = temp1 * MF + temp1;
GOut[gid] = temp1;
}
}
void WGSXMAPIXLLXOPS8_wrapper (const float *GIn, float *GOut,
const float M, const float N, const float P,
int numBlocks, int threadPerBlock) {
WGSXMAPIXLLXOPS8<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P);
}
#endif
内核#2:
#ifndef kernelWGSXMAPIXLLXOPS16_H_
#define kernelWGSXMAPIXLLXOPS16_H_
__global__ void WGSXMAPIXLLXOPS16 (const float *GIn, float *GOut, const float M, const float N, const float P) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
... // 16 FMA operations
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
GOut[gid] = temp1;
}
}
void WGSXMAPIXLLXOPS16_wrapper (const float *GIn, float *GOut,
const float M, const float N, const float P,
int numBlocks, int threadPerBlock) {
WGSXMAPIXLLXOPS16<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P);
}
#endif
内核#3:
#ifndef kernelWGSXMAPIXLLXOPS32_H_
#define kernelWGSXMAPIXLLXOPS32_H_
__global__ void WGSXMAPIXLLXOPS32 (const float *GIn, float *GOut, const float M, const float N, const float P) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
... // 32 FMA operations
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
GOut[gid] = temp1;
}
}
void WGSXMAPIXLLXOPS32_wrapper (const float *GIn, float *GOut,
const float M, const float N, const float P,
int numBlocks, int threadPerBlock) {
WGSXMAPIXLLXOPS32<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P);
}
#endif
线程总数已设置为 16384,块大小为 256。我计算了每个内核的总 GFlops,分别等于 20.44、56.53 和 110.12 GFlops。我试图想出一个解释,但什么都没有出现在我的脑海里。所以我尝试使用 nvprof 并监控所有指标。所有指标几乎相等,以下是一些对我来说很重要的指标(我还包括内核 1 到 3 的结果):
sm_efficiency_instance: 14.99, 16.78, 19.82 %
ipc_instance: 0.57 , 0.93 , 1.53
inst_replay_overhead: 0.399, 0.268, 0.165
dram_write_throughput: 18.08, 17.72, 16.9 GB/s
issued_ipc: 0.99 , 1.18 , 1.52
issue_slot_utilization: 19.48, 24.64, 33.76 %
stall_exec_dependency: 21.84, 26.38, 42.95 %
很明显,它们都具有相同的 dram_write_throughput,因为它们都向 DRAM 写入相同数量的数据,并且线程总数相同。我不明白的是 sm_efficiency。我的内核都在做算术(相同),他们的 sm_efficiency 怎么不一样。另外,为什么在同一个内核中有更多的算术会提高效率?我的理解是,在 SM 上寻找经线定位时,它们都应该有相同的问题。
任何人都可以使用以下指标帮助我了解 GFlops 的区别吗?