0

I'm trying to print float values from kernel using calls to printf. I'm doing this to check on another program I'm working on that requires copying float arrays from host to device. I wrote a kernel to check the values stored inside the float array in the device, only to get 0 in return.

So I wrote this code to check :

#include <stdio.h>

#define ARR_LENGTH 3

__global__ void checkArr(float* arr);

int main(void)  
{  
    float* arr = (float*) malloc(sizeof(float) * ARR_LENGTH);

    float cont = 0;
    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        arr[i] = cont;
        cont++;
    }

    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        printf("arr[%d] : %f\n", i , arr[i]);
    }


    float* d_arr;
    cudaMalloc((void**) &d_arr, sizeof(float) * ARR_LENGTH);
    cudaMemcpy(d_arr, arr, sizeof(float) * ARR_LENGTH, cudaMemcpyHostToDevice);

    printf("got here\n");

    checkArr<<<1,1>>>(d_arr);

    printf("got here\n");

    float* check = (float*) malloc(sizeof(float) * ARR_LENGTH);
    cudaMemcpy(check, d_arr, sizeof(float) * ARR_LENGTH, cudaMemcpyDeviceToHost);   
    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        printf("arr[%d] : %f\n", i , check[i]);
    }

}

__global__ void checkArr(float* arr ) 
{
    float check = 5.0;
    printf("float check : %f\n", check);
    printf("float check : %f\n", check + 1.0);
    printf("float check : %f\n", check + 2.0);

    for(int i = 0 ; i < ARR_LENGTH ; i++) {
        printf("arr[%d] : %f\n", i , arr[i]);
    }
}

with the output :

arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
got here
float check : 0
float check : 0
float check : 0
arr[0] : 2.4375
arr[1] : 2.4375
arr[2] : 2.4375
got here
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000

if I didn't put the 'float checks :' before printing the values of the array, the values of the array will return 0. It's kinda weird.. any explanation? Does it mean I can't inspect the value of float values inside the device memory? (as you can see, int seems to be returned fine)

I compile the program with -arch=sm_20. As I don't have a CUDA compatible device at home, I compiled and run the check using GPUOcelot. Can you reproduce this error with a compatible device?

Cheers, AErlaut

4

1 回答 1

1

当我在实际的sm_20gpu (M2090) 上编译并运行您的代码时,我得到以下输出。

$ ./t97
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
got here
got here
float check : 5.000000
float check : 6.000000
float check : 7.000000
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
arr[0] : 0.000000
arr[1] : 1.000000
arr[2] : 2.000000
$

请注意,真实设备上内核中的 printf 与来自主机的 printf 队列有些异步,因此输出的顺序可能不同。

我的意思是向您建议 GPU 行为和 Ocelot 行为可能不同。如果您继续发布“请帮我在真正的 GPU 上检查我的 Ocelot 程序”,我将不会回复这些内容。

于 2013-03-23T15:49:04.447 回答