我正在 Pascal 上运行内存合并实验并得到意想不到的nvprof
结果。我有一个内核,可以将 4 GB 的浮点数从一个数组复制到另一个数组。nvprof
报告gld_transactions_per_request
和的混淆数字gst_transactions_per_request
。
我在 TITAN Xp 和 GeForce GTX 1080 TI 上进行了实验。结果相同。
#include <stdio.h>
#include <cstdint>
#include <assert.h>
#define N 1ULL*1024*1024*1024
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void copy_kernel(
const float* __restrict__ data, float* __restrict__ data2) {
for (unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < N; tid += blockDim.x * gridDim.x) {
data2[tid] = data[tid];
}
}
int main() {
float* d_data;
gpuErrchk(cudaMalloc(&d_data, sizeof(float) * N));
assert(d_data != nullptr);
uintptr_t d = reinterpret_cast<uintptr_t>(d_data);
assert(d%128 == 0); // check alignment, just to be sure
float* d_data2;
gpuErrchk(cudaMalloc(&d_data2, sizeof(float)*N));
assert(d_data2 != nullptr);
copy_kernel<<<1024,1024>>>(d_data, d_data2);
gpuErrchk(cudaDeviceSynchronize());
}
使用 CUDA 10.1 版编译:
nvcc coalescing.cu -std=c++11 -Xptxas -dlcm=ca -gencode arch=compute_61,code=sm_61 -O3
简介:
nvprof -m all ./a.out
分析结果中有一些令人困惑的部分:
gld_transactions = 536870914
,这意味着每个全局负载事务平均应该是4GB/536870914 = 8 bytes
。这与以下一致gld_transactions_per_request = 16.000000
:每个 warp 读取 128 个字节(1 个请求),如果每个事务是 8 个字节,那么我们需要128 / 8 = 16
每个请求的事务。为什么这个值这么低?我期望完美的合并,所以类似于 4 个(甚至 1 个)事务/请求。gst_transactions = 134217728
并且gst_transactions_per_request = 4.000000
,所以存储内存更有效?- 请求和实现的全局加载/存储吞吐量 (
gld_requested_throughput
,gst_requested_throughput
,gld_throughput
,gst_throughput
) 是150.32GB/s
每个。我希望负载的吞吐量低于商店的吞吐量,因为我们每个请求有更多的事务。 gld_transactions = 536870914
但是l2_read_transactions = 134218800
。始终通过 L1/L2 高速缓存访问全局内存。为什么 L2 读取事务的数量如此之少?它不能全部缓存在 L1 中。(global_hit_rate = 0%
)
我想我读nvprof
错了结果。任何建议,将不胜感激。
这是完整的分析结果:
Device "GeForce GTX 1080 Ti (0)"
Kernel: copy_kernel(float const *, float*)
1 inst_per_warp Instructions per warp 1.4346e+04 1.4346e+04 1.4346e+04
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
1 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 99.99% 99.99% 99.99%
1 inst_replay_overhead Instruction Replay Overhead 0.000178 0.000178 0.000178
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 16.000000 16.000000 16.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 536870914 536870914 536870914
1 gst_transactions Global Store Transactions 134217728 134217728 134217728
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 134218800 134218800 134218800
1 l2_write_transactions L2 Write Transactions 134217741 134217741 134217741
1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 50.00% 50.00% 50.00%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
1 tex_cache_throughput Unified Cache Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_read_throughput L2 Throughput (Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_write_throughput L2 Throughput (Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 5.8711KB/s 5.8711KB/s 5.8701KB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 tex_cache_transactions Unified Cache Transactions 134217728 134217728 134217728
1 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 inst_executed Instructions Executed 470089728 470089728 470089728
1 inst_issued Instructions Issued 470173430 470173430 470173430
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.79% 0.79% 0.79%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.46% 1.46% 1.46%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 96.16% 96.16% 96.16%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 1.13% 1.13% 1.13%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.00% 0.00% 0.00%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.07% 0.07% 0.07%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 1.0742e+10 1.0742e+10 1.0742e+10
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_control Control-Flow Instructions 1073741824 1073741824 1073741824
1 inst_compute_ld_st Load/Store Instructions 2147483648 2147483648 2147483648
1 inst_misc Misc Instructions 1077936128 1077936128 1077936128
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 470173430 470173430 470173430
1 cf_issued Issued Control-Flow Instructions 33619968 33619968 33619968
1 cf_executed Executed Control-Flow Instructions 33619968 33619968 33619968
1 ldst_issued Issued Load/Store Instructions 268500992 268500992 268500992
1 ldst_executed Executed Load/Store Instructions 67174400 67174400 67174400
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 134217728 134217728 134217728
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.00% 0.00%
1 stall_not_selected Issue Stall Reasons (Not Selected) 0.39% 0.39% 0.39%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 134217728 134217728 134217728
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 pcie_total_data_transmitted PCIe Total Data Transmitted 1024 1024 1024
1 pcie_total_data_received PCIe Total Data Received 0 0 0
1 inst_executed_global_loads Warp level instructions for global loads 33554432 33554432 33554432
1 inst_executed_local_loads Warp level instructions for local loads 0 0 0
1 inst_executed_shared_loads Warp level instructions for shared loads 0 0 0
1 inst_executed_surface_loads Warp level instructions for surface loads 0 0 0
1 inst_executed_global_stores Warp level instructions for global stores 33554432 33554432 33554432
1 inst_executed_local_stores Warp level instructions for local stores 0 0 0
1 inst_executed_shared_stores Warp level instructions for shared stores 0 0 0
1 inst_executed_surface_stores Warp level instructions for surface stores 0 0 0
1 inst_executed_global_atomics Warp level instructions for global atom and atom cas 0 0 0
1 inst_executed_global_reductions Warp level instructions for global reductions 0 0 0
1 inst_executed_surface_atomics Warp level instructions for surface atom and atom cas 0 0 0
1 inst_executed_surface_reductions Warp level instructions for surface reductions 0 0 0
1 inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS 0 0 0
1 inst_executed_tex_ops Warp level instructions for texture 0 0 0
1 l2_global_load_bytes Bytes read from L2 for misses in Unified Cache for global loads 4294967296 4294967296 4294967296
1 l2_local_load_bytes Bytes read from L2 for misses in Unified Cache for local loads 0 0 0
1 l2_surface_load_bytes Bytes read from L2 for misses in Unified Cache for surface loads 0 0 0
1 l2_local_global_store_bytes Bytes written to L2 from Unified Cache for local and global stores. 4294967296 4294967296 4294967296
1 l2_global_reduction_bytes Bytes written to L2 from Unified cache for global reductions 0 0 0
1 l2_global_atomic_store_bytes Bytes written to L2 from Unified cache for global atomics 0 0 0
1 l2_surface_store_bytes Bytes written to L2 from Unified Cache for surface stores. 0 0 0
1 l2_surface_reduction_bytes Bytes written to L2 from Unified Cache for surface reductions 0 0 0
1 l2_surface_atomic_store_bytes Bytes transferred between Unified Cache and L2 for surface atomics 0 0 0
1 global_load_requests Total number of global load requests from Multiprocessor 134217728 134217728 134217728
1 local_load_requests Total number of local load requests from Multiprocessor 0 0 0
1 surface_load_requests Total number of surface load requests from Multiprocessor 0 0 0
1 global_store_requests Total number of global store requests from Multiprocessor 134217728 134217728 134217728
1 local_store_requests Total number of local store requests from Multiprocessor 0 0 0
1 surface_store_requests Total number of surface store requests from Multiprocessor 0 0 0
1 global_atomic_requests Total number of global atomic requests from Multiprocessor 0 0 0
1 global_reduction_requests Total number of global reduction requests from Multiprocessor 0 0 0
1 surface_atomic_requests Total number of surface atomic requests from Multiprocessor 0 0 0
1 surface_reduction_requests Total number of surface reduction requests from Multiprocessor 0 0 0
1 sysmem_read_bytes System Memory Read Bytes 0 0 0
1 sysmem_write_bytes System Memory Write Bytes 160 160 160
1 l2_tex_hit_rate L2 Cache Hit Rate 0.00% 0.00% 0.00%
1 texture_load_requests Total number of texture Load requests from Multiprocessor 0 0 0
1 unique_warps_launched Number of warps launched 32768 32768 32768
1 sm_efficiency Multiprocessor Activity 99.63% 99.63% 99.63%
1 achieved_occupancy Achieved Occupancy 0.986477 0.986477 0.986477
1 ipc Executed IPC 0.344513 0.344513 0.344513
1 issued_ipc Issued IPC 0.344574 0.344574 0.344574
1 issue_slot_utilization Issue Slot Utilization 8.61% 8.61% 8.61%
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.592326 0.592326 0.592326
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 tex_fu_utilization Texture Function Unit Utilization Low (1) Low (1) Low (1)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
1 dram_read_transactions Device Memory Read Transactions 134218560 134218560 134218560
1 dram_write_transactions Device Memory Write Transactions 134176900 134176900 134176900
1 dram_read_throughput Device Memory Read Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 dram_write_throughput Device Memory Write Throughput 150.27GB/s 150.27GB/s 150.27GB/s
1 dram_utilization Device Memory Utilization High (7) High (7) High (7)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 ecc_transactions ECC Transactions 0 0 0
1 ecc_throughput ECC Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 dram_read_bytes Total bytes read from DRAM to L2 cache 4294993920 4294993920 4294993920
1 dram_write_bytes Total bytes written from L2 cache to DRAM 4293660800 4293660800 4293660800