Variable in OpenCL kernel 'for-loop' reduces performance -
There is a loop for me in my colonel, my code was rigid-coded to iterate me for a certain number of :
for (int kk = 0; kk & lt; 50000; kk ++) {& lt; ... my code here ... & gt; } I do not think the code in the loop is relevant to my question, it is a very simple table showing and is integer math.
I wanted to make my colonel code a bit more flexible so that I could modify the loop so that the number of iterations of my loop (50000) be replaced by the kernel input parameter 'num_loops'.
for (int kk = 0; kk & lt; num_loops; kk ++) {& lt; ... and code here ... & gt; } The thing I found is that whenever my host program Kernel is
num_loops = 50000 which There is already a value similar to the hard coded value, the performance of my kernel is cut in about half.
I am trying to figure out what is going on in the performance, I think there is something to do with the OpenCL compiler which is not able to unlock the loop efficiently?
Am I trying to do without imposing a penalty?
Update: There are some results by playing with "#pragma unroll" Here, it seems that unopened loops to my display problems Does not solve.
Unorrolling even the hard-working Loop Degrades display
Here is a general loop with hard-coded value (best performance):
For (int kk = 0; kk & letocho; 50000; kk ++) // Execution time = 0.18 (40180 MI ops / sec) If I unlock the loop, then things get worse:
Unblock 50000 for #pragma unroll // or #pragma (int kk = 0; kk & lt; 50000; kk ++) Unlock / time to take time = 0.22 (33000 MI ops / sec) There is a loop which uses the variable, num_loops = 50000:
for (int kk = 0; kk & lt; num_loops; kk ++) // execution time = 0.26 (27760 MI ops / sec) #pragma u for nomination 50000 (int Time = 0.26 (27760 MI ops / second) for #pragma (int kk = 0; kk & lt; num_loops; kk ++) to execute kike = 0; kk & lt; num_loops; kk ++) Time to unlock // Time to execute = 0.24 (30280 MI ops / sec) Straighten with "#Proma Anol" becomes slightly better when using the num_loops variable , Although this performance is still difficult-to Injury, is 25% slower than Anolld version.
Any other ideas on how to use num_loops as a loop variable without the hit of the display?
Yes, most of the reasons for the decline in performance is that the compiler can not unlock the loop. There are some things that you can try to improve the situation.
You can define parameters as a preprocessor macro passed through your program build option. This is a common move that is used to create values which are known only as run-time skeletal time constants. For example:
Clubeild program (Program, 1, and Device, "-Nan_Lops = 50000", zero, zero); You can dynamically create build options using the sprintf to make it more convenient. It is obviously worth it only if you often get the parameter There is no need to change, so that the overhead of the reprint is not a problem.
You can check that your OpenCL platform can be used by any programmers who can provide compiler signs about loop-anololing. For example, some OpenCL compilers recognize #pragma unroll (or similar). OpenCL 2.0 has a feature for: __feature__ ((opencl_unroll_hint)) .
You can manually unlock the loop depending on how it looks, depending on what you can guess about the num_loops parameter. For example, if you know (or can be sure) that it will always be a valuable value of 4, then you can do something like: for (int kk = 0 ; Kk & lt; num_loops;) {& lt; ... and more code here ... & gt; KK ++; & Lt; ... and more code here ... & gt; KK ++; & Lt; ... and more code here ... & gt; KK ++; & Lt; ... and more code here ... & gt; KK ++; } Even if you can not create such assumptions, you should still be able to unlock the manual, but it may require some extra work (For example, to eliminate the remaining iterations).
Comments
Post a Comment