I have the following classes:
class host_list{
host_vector<int> id;
host_vector<int> weight;
/*...irrelevant functions and variables...*/
host_list& operator= (const device_list& TheOther );
};
class device_list{
device_vector<int> id;
device_vector<int> weight;
/*...irrelevant functions and variables...*/
device_list& operator= (const host_list& TheOther );
};
and my functions:
void FillSampleData(host_list& dest);//just fills the two vectors with
// 40-40 members, runs on CPU
int main(void){
host_list input;
int result[5]={0};
FillSampleData(input);
EvaluateData(input,result);
/*...etc...*/
}
void EvaluateData(host_list& input,int*& result){
device_list d_list;
[1]cudaDeviceSynchronize();
[2]d_list=input;
/*...etc...*/
}
At first I thought that there is some error since copying the input vector to the gpu [2] took aprox. 2 minutes. After a bit of searching I've found that writing the gpu global memory has to wait for all kernel launches to finish, so I added [1] just to see what happens.
As a result [2] runs as fast as it should but the syncronization line [1] runs for over 2 minutes.
Can someone tell me where is the hidden kernel call the code is waiting for, or that what am I missing (I was thinking about a pre initialization but I never had to do something like that before, so I doubt that it is what I'm missing)?