I have an simulation application that I have written both in C and CUDA. To measure the speedup I have recorded the time in both cases. In CUDA, I have used CUDA events to measure the time and then dividing the time of GPU by CPU (as usually done). The image of the speedup is provided below.
The weird thing about the speedup graph is that the speedup first increases to 55X and then it decreases to 35X and then again increases as the total number of thread increases. I am not sure why this is happening and how I would be able to figure out the reason behind such an output. I am using a GTX 560ti GPU card with 448 cores. The number of threads for each block is 1024 (maximum number) and so 1 block at a time for each SM. Is it happening because of the occupancy issues and how could I definitely figure out the reason behind this kind of speedup graph?