I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:
- Why it is not possible to overlap memHtoD with GPU kernel with GTX 590,
- CUDA kernels not launching before CudaDeviceSynchronize
- Time between Kernel Launch and Kernel Execution
and suggest to:
- flush the WDDM queue with
cudaEventQuery()
orcudaEventQuery()
. (Does not work). - submit streams in breadth first manner. (Does not work).
This thread argues it is a bug in fermi:
This thread:
proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.
However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this problem.