6

我试图将内核执行与 memcpyasync 重叠,但它不起作用。我遵循编程指南中的所有建议,使用固定内存、不同的流等。我看到内核执行确实重叠,但它与内存传输无关。我知道我的卡只有一个复制引擎和一个执行引擎,但是执行和传输应该重叠,对吧?

似乎“复制引擎”和“执行引擎”总是执行我调用函数的顺序。工作包括执行 [HtoD x2, Kernel, DtoH] 的 4 个流。如果我在每个流上发出 HtoDx2,Kernel,DtoH 系列,我会在分析器中看到 stream2 HtoD 第一个操作在第一个 DtoH 操作结束之前不会开始。如果我首先在每个流上发布 HtoD,然后是第二个 HtoD,然后是内核,然后是 DtoH(宽度),我看不到重叠,并且发布顺序也由 GPU 强制执行。

我已经尝试过使用 CUDA SDK 中给出的 simpleStreams 示例,我也看到了相同的行为。

我附上了一些屏幕截图,显示了 VS2008 的视觉分析器和 Nsight 中的问题。

附言。我没有设置 CUDA_LAUNCH_BLOCKING 环境

简单流可视化分析器 简单流可视化分析器

MyApp Nsight 时间线广度优先 MyApp Nsight 时间线广度优先

MyApp Nsight 时间线深度优先 MyApp Nsight 时间线深度优先

编辑

放置额外的 x4 内核(总共 2HtoD,5 个内核,每个流 1DtoH)--> 如果我在使用和不使用 --concurrent-kernels-off 的情况下运行 nvprof,则经过的时间是相同的。如果我设置 env CUDA_LAUNCH_BLOCKING=1,那么我会看到(从命令行)7.5% 的性能提升!

系统规格:

  • Windows 7的
  • 第一个 PCI-E 插槽中的 NVIDIA 6800 VGA
  • 第二个 PCI-E 插槽中的 GTX480
  • 英伟达驱动程序:306.94
  • 视觉工作室 2008
  • CUDA v5.0
  • 可视化探查器 5.0
  • Nsight 3.0
4

3 回答 3

0

正如我在评论中所说,CUDA 驱动程序确实存在一个错误,它使流媒体无法与我的设置一起使用。我已经测试了 1.1 功能卡 (8800 GTS) 和 3.5 功能卡 (GTX Titan) 并且两张卡都可以正常工作。似乎某些 Fermi 卡有问题(我的 GTX 480 不工作)。

于 2013-05-26T11:49:28.323 回答
0

I just incurred with the same problem. I agree with your that there is a BUG. I think the bug is either in CUDA driver for Windows, or in the Windows itself. I have tested my code and it works well (with overlapping) in Linux.

In fact, you could test the "simpleStreams" example in SDK. I found that the "simpleStreams" running in Windows doesn't have overlapping between kernel and memory copy at all, but when in Linux it works perfectly.

I am using CUDA 5.0 and Fermi GTX570. With your test on 8800GT and GTX Titan, I would agree it is a bug in the CUDA driver for Windows. Hopefully it will be fixed soon.

于 2013-05-28T02:24:59.223 回答
0

TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the TDR delay value to a very high number, and the "enabled" option to true, the issue goes away. Please, try the options described below (more common), because they are also related to the problem!

Read below for other (older) steps followed until i came to the solution above, and some other possible causes.

I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.

Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!

  1. Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
  2. Go to computer>properties
  3. Select advanced system settings on the left side
  4. Go to the Advanced tab
  5. On Performance click settings
  6. In the Visual Effects tab, select the "adjust for best performance" bullet.

This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!

Alternatively, you can:

  1. Right click on desktop, select personalize
  2. Select a theme from basic themes, that doesn't have aero.

This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.

Please, when you try these solutions, come back here and post your findings!

For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...

更新:问题已通过新的 Windows 安装完全解决!之前的步骤改善了某些情况下的行为,但全新安装解决了所有问题!

我会尝试找到一种不太激进的方法来解决这个问题,也许只恢复注册表就足够了。

于 2015-03-18T08:31:49.307 回答