“nvvp”的相关标签问题_Stack Overflow中文网

0 投票

2 回答

712 浏览

nvvp - NVidia Visual Profiler 的分析器开销过多

尝试使用nvvp（或使用nvprof）分析我的代码时，我得到了很多分析开销：

总时间为 98 毫秒，我在第一次内核启动时获得了 85 毫秒的“仪器”。

如何减少此分析开销或仅放大我感兴趣的部分？

背景

我在未选中“启用分析的情况下开始执行”的情况下运行它，并且我使用cudaProfilerStart/限制了分析，cudaProfilerStop如下所示：

applyConvolution()我正在分析的功能在哪里。

我在带有 GTX 1080 的 Ubuntu 16.04 上使用 CUDA Toolkit 8.0。

nvvp

2017-10-23T23:53:20.327

0 投票

1 回答

294 浏览

cuda - 如何在无头节点上分析 CUDA 代码？

我正在开发一个我想要分析的 CUDA 应用程序。到目前为止，我使用的只是命令行分析器，它只nvprof显示汇总的统计信息。

我考虑过使用 GUI 分析器 NVVP。问题是我运行应用程序的远程 Linux 节点没有任何 GUI（甚至是 X.org）。此外，即使我设法在远程节点上获得了一些 X11 堆栈，在整个分析过程中保持我自己的笔记本电脑处于活动状态也是非常乏味的。

我尝试通过以下方式收集所有需要的信息：

然后我将输出文件复制到我的笔记本电脑上并在 NVVP 中查看。但是，这有三个问题。

首先，当我将输出文件加载到 NVVP 时，我没有得到任何文件传输信息。它根本没有显示在 NVVP 窗口中。

其次，调用图完全扭曲。内核启动之间的间隔至少比内核持续时间大 100 倍，这使得任何依赖关系和流分析都是不可能的。

最后，我的应用程序使用了大量的 GPU 内存。在分析期间，设备内存不足，而在独立运行期间并非如此。

我应该如何在无头节点上正确分析我的 CUDA 应用程序？

2017-11-07T21:34:11.333

0 投票

1 回答

2460 浏览

cuda - 如何为 Nvidia Visual Profiler 指定 nvprof“设备”选项？

CUDA Toolkit 9.0、Windows 10、GTX 1060 和 NVS 315、385.54 驱动程序版本。

Nvidia Visual Profiler 总是无法进行分析，返回以下两条警告消息：

“警告：此版本的 nvprof 不支持底层设备，GPU 分析已跳过”

“警告：未分析任何 CUDA 应用程序，正在退出”

请注意，我的机器安装了两个 GPU。物理移除 NVS 315 可解决问题；可视化探查器工作。通过设备管理器禁用 NVS 315 也可以。我只想分析 GTX 1060，但我想在安装 NVS 315 的情况下这样做，而不是禁用。

使用 nvprof 并指定“--devices”选项有效：

然而

从 nvprof 导出结果，然后在 Visual Profiler 中打开工作：

但我很懒，不想在每次配置文件时都重复一百次。

因此，NVS 315 似乎有些不兼容的地方。此外，为 Visual Profiler (nvvp) 指定类似“--devices 0”的内容似乎是一种解决方案。要是我知道怎么做就好了。

cuda nvprof nvvp

2018-04-17T12:19:59.900

0 投票

1 回答

338 浏览

cuda - CUDA nvvp 提供的报告中的“Instruction Issued”是什么意思？

我使用 Nvidia 视觉分析器 (nvvp) 在 cublas 内核上执行内核分析。这个链接Latency Distribution是延迟分布的结果。

该文件以这种方式解释了“指令发出”一词 - “指令发出 - 经线发出”，这让我感到困惑。它实际上是什么意思？

cuda profiling nvvp

2018-04-19T13:17:08.360

0 投票

1 回答

1364 浏览

linux - 分析任意 CUDA 应用程序

我当然知道nvvpand的存在nvprof，但是由于各种原因nvprof不想使用涉及大量共享库的应用程序。nvidia-smi可以挂钩驱动程序以找出正在运行的内容，但我找不到nvprof连接到正在运行的进程的好方法。

有一个标志--profile-all-processes实际上确实给了我一条消息“NVPROF 正在分析过程 12345”，但没有进一步打印出来。我正在使用 CUDA 8。

在这种情况下，如何获得 CUDA 内核的详细性能细分？

linux cuda nvprof nvvp

2018-05-18T03:39:46.033

0 投票

1 回答

448 浏览

cuda - How to print api calls per thread with nvprof

I am profiling a CUDA application and dumping the logs to a file say target.prof

My application uses multiple threads to dispatch kernels and I want to observe the api calls from just one of those threads. I tried using nvprof -i target.prof --print-api-trace but this does not print the thread_id.

When I open this file with the visual profiler, I can see which API calls were launched from which thread. How can I access the same information using the command line profiler?

Edit: View in the visual profiler

cuda gpu nvidia nvprof nvvp

2018-09-12T05:28:27.540

0 投票

1 回答

160 浏览

cuda - CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking

EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).

So, I have crafted a "minimal" working example, which follows:

When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.

However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing

I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.

I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?

And why is that I do not get a performance hit?

I hope that this edit has clarified some unclear points.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I'll start wit some kernel code to exemplify my question, which will follow below

Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.

The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.

Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:

Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?

As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.

Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).

cuda nvvp

2018-11-08T18:32:50.060

0 投票

0 回答

234 浏览

cuda - 由于“内存依赖性”，仅寄存器指令如何停止？

我正在使用启用 PC 采样的 nvprof 分析 CUDA 内核，以了解我遇到的一些延迟问题。我使用的 GPU 是 P100（计算 6.0）

PC 采样报告说 DFMA 由于内存依赖性而经常停止。DFMA 的 SASS 代码如下：

我对这个问题的看法是，R8 需要通过 LDG.E.CI.64 加载，L2 上的未命中率非常高。

内存依赖停顿的定义是“无法进行加载/存储，因为所需的资源不可用或未充分利用，或者给定类型的太多请求未完成”。

让我感到困惑的是 DFMA 不是加载/存储操作，如果我认为停顿是由于 R8 上不可用的数据是正确的，那么它应该是执行依赖。DFMA 上的内存依赖停止意味着什么？

cuda nvprof nvvp

2018-12-23T20:24:32.507

0 投票

1 回答

616 浏览

cuda - 为什么两个 CUDA 流中的操作不重叠？

我的程序是一个管道，它包含多个内核和 memcpys。每个任务将通过具有不同输入数据的相同管道。主机代码在处理任务时首先会选择一个 Channel，它是暂存器内存和 CUDA 对象的封装。在最后一个阶段之后，我将记录一个事件，然后去处理下一个任务。
主要流水线逻辑如下。问题是不同流中的操作不重叠。我附上了处理 10 个任务的时间表。您可以看到流中的任何操作都没有重叠。对于每个内核，一个块中有 256 个线程，一个网格中有 5 个块。用于 memcpy 的所有缓冲区都已固定，我确信我已满足这些要求用于重叠内核执行和数据传输。有人可以帮我找出原因吗？谢谢。

环境信息
GPU：Tesla K40m (GK110)
Max Warps/SM：64
Max Thread Blocks/SM：16
Max Threads/SM：2048
CUDA版本：8.0