0

I have the following classes:

class host_list{
    host_vector<int> id;
    host_vector<int> weight;
    /*...irrelevant functions and variables...*/
    host_list& operator= (const device_list& TheOther );
};

class device_list{
    device_vector<int> id;
    device_vector<int> weight;
    /*...irrelevant functions and variables...*/
    device_list& operator= (const host_list& TheOther );
};

and my functions:

void FillSampleData(host_list& dest);//just fills the two vectors with
                // 40-40 members, runs on CPU

int main(void){
    host_list input;
    int result[5]={0};

    FillSampleData(input);

    EvaluateData(input,result);
    /*...etc...*/
}

void EvaluateData(host_list& input,int*& result){
    device_list d_list;

    [1]cudaDeviceSynchronize();
    [2]d_list=input;
    /*...etc...*/
}

At first I thought that there is some error since copying the input vector to the gpu [2] took aprox. 2 minutes. After a bit of searching I've found that writing the gpu global memory has to wait for all kernel launches to finish, so I added [1] just to see what happens.

As a result [2] runs as fast as it should but the syncronization line [1] runs for over 2 minutes.

Can someone tell me where is the hidden kernel call the code is waiting for, or that what am I missing (I was thinking about a pre initialization but I never had to do something like that before, so I doubt that it is what I'm missing)?

4

0 回答 0