caching - 使用 nvprof 在 CUDA 计算能力 3.x 上分析 L2 缓存

Question

我在分析计算能力 3.5 的 CUDA 卡上的 L2 缓存时遇到问题。在 Kepler (3.x) 中，来自全局内存的负载只缓存在 L2 中，从不在 L1 中。我的问题是如何使用 nvprof（命令行分析器）来查找我的全局负载在 L2 缓存中达到的命中率？我已经查询了我可以收集的所有指标，涉及 L2 疼痛的指标是：

         l2_l1_read_hit_rate:  Hit rate at L2 cache for all read requests from L1 cache
    l2_texture_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
       l2_l1_read_throughput:  Memory read throughput seen at L2 cache for read requests from L1 cache
  l2_texture_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
        l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
       l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
          l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
         l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
              l2_utilization:  The utilization level of the L2 cache relative to the peak utilization

我得到的唯一命中率是来自 L1 的读取。但是对全局内存的读取永远不会来自 L1，因为它们没有缓存在那里！还是我在这里错了，这正是我想要的指标？

令人惊讶的是（或没有）有一个指标给出了全局内存读取的 L1 命中率。

    l1_cache_global_hit_rate:  Hit rate in L1 cache for global loads

对于开普勒来说，这可能是非零的吗？

干杯!

score 3 · Accepted Answer

在 CC 3.5 设备上，全局负载有两条路径。LDG 指令通过纹理单元 (l2_texture_read_hit_rate)。所有其他全局加载操作（包括未缓存加载）都经过 L1 到 L2 (l2_l1_read_hit_rate)。计数器名称是 l2__read_hit_rate。此计数器并不意味着负载已缓存在 L1 中。

如果开发人员启用 L1 缓存，则计数器 l1_cached_global_hit_rate 在 GK110B 和 GK210 上可以是非零。有关详细信息，请参阅L1 缓存上的 Kepler Tuning Guide 部分。

score 0 · Accepted Answer

使用默认的 nvcc 编译，它将为 0。但是，如果使用 -Xptxas -dlcm=ca 进行编译，则它可以为非零。

caching - 使用 nvprof 在 CUDA 计算能力 3.x 上分析 L2 缓存

2 回答 2

Related

Reference