我试图了解 CUDA 统一内存的功能。我已经阅读了针对初学者的关于 CUDA 统一内存的博客。我写了下面给出的代码:
#include <cstdio>
#include <iostream>
#include <fstream>
#include <climits>
#include <vector>
__global__ void transfer(int *X)
{
X[threadIdx.x] = X[threadIdx.x]+3;
}
using namespace std;
int main()
{
int *x;
size_t free_bytes, total_bytes;
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "Before cudaMallocManaged: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
cudaMallocManaged(&x,sizeof(int)*512);
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "After cudaMallocManaged and Before Prefetch to GPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
std::cout << cudaMemPrefetchAsync(x, sizeof(int)*512, 0);
cudaMemset(x,0,sizeof(int)*512);
cudaDeviceSynchronize();
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "\nAfter Prefetch to GPU Before Kernel call: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
transfer<<<1,512>>>(x);
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "After Kernel call Before memAdvise: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
cudaMemAdvise(x,sizeof(int)*512, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "After memAdvise Before Prefetch to CPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
std::cout << cudaMemPrefetchAsync(x, sizeof(int)*512, cudaCpuDeviceId);
cudaDeviceSynchronize();
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "\nAfter Prefetch Before processing in CPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
for(int i=0;i<512;i++)
{
x[i] = x[i]+1;
std::cout << x[i];
}
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "\nAfter processing in CPU Before free: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
cudaFree(x);
cudaMemGetInfo(&free_bytes, &total_bytes);
std::cout << "After free: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
return 0;
}
输出:
Before cudaMallocManaged: free: 16804216832 total: 17071734784
After cudaMallocManaged and Before Prefetch to GPU: free: 16804216832 total: 17071734784
0
After Prefetch to GPU Before Kernel call: free: 16669999104 total: 17071734784
After Kernel call Before memAdvise: free: 16669999104 total: 17071734784
After memAdvise Before Prefetch to CPU: free: 16669999104 total: 17071734784
0
After Prefetch Before processing in CPU: free: 16669999104 total: 17071734784
44444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
After processing in CPU Before free: free: 16669999104 total: 17071734784
After free: free: 16674193408 total: 17071734784
我在提供 16 GB Tesla P100 PCIe GPU 的 Kaggle 上运行代码。x
我有一个使用分配的整数数组cudaMallocManaged()
。首先,我在 GPU 中预取数组并对其进行一些处理,然后将其预取到 CPU 并进行一些处理。在这两者之间,我打印了内存传输前后 GPU 上可用的空闲内存。基于此,我有两个问题:
在空闲内存减少后
cudaMallocManaged()
的第一次预取期间,比我分配的要多得多。为什么?预取到 CPU 前后的空闲内存是一样的。此外,当我访问和修改 CPU 上的数组时,GPU 上的可用内存在此之前和之后仍然保持不变。我不明白为什么会这样。在预取/处理 CPU 上的统一内存位置时,GPU 上的相应页面不应该被驱逐并移动到 CPU,这不应该释放 GPU 内存吗?