PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

the difference between kernels and parallel instructions

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Teslalady



Joined: 16 Mar 2012
Posts: 75

PostPosted: Mon Jun 11, 2012 6:12 am    Post subject: the difference between kernels and parallel instructions Reply with quote

Dear Mat,

I just want to know the difference with kernels and parallel ,so I worte a code as belows:

Code:
#include<stdio.h>
#include<stdlib.h>
#include<accel.h>
#include<time.h>
#define N 1024
int main()
{
    struct timeval start,end;
    double timeuse;
    int i;
    int n=N;
    int x[N],y[N];
    for(i=0;i<N;i++)
    {
        x[i]=rand()%10;
        y[i]=rand()%10;
    }
    gettimeofday(&start,NULL);
    #pragma acc kernels copy(x[0:1023],y[0:1023])
    {
        for(i=1;i<n-1;i++)
        {
            x[i]=0.5*y[i]+0.25*(y[i-1]+y[i+1]);
        }
        for(i=1;i<n-1;i++)
        {
            y[i]=0.5*x[i]+0.25*(x[i-1]+x[i+1]);
        }
    }
    gettimeofday(&end,NULL);
    timeuse = 1000000 * ( end.tv_sec - start.tv_sec ) + end.tv_usec - start.tv_usec;
    timeuse /= 1000000;
    printf("kernels time used:%f\n",timeuse);

    gettimeofday(&start,NULL);
    #pragma acc parallel copy(x[0:1023],y[0:1023])
    {
        #pragma acc loop
        for(i=1;i<n-1;i++)
        {
            x[i]=0.5*y[i]+0.25*(y[i-1]+y[i+1]);
        }
        #pragma acc loop
        for(i=1;i<n-1;i++)
        {
            y[i]=0.5*x[i]+0.25*(x[i-1]+x[i+1]);
        }
    }
    gettimeofday(&end,NULL);
    timeuse = 1000000 * ( end.tv_sec - start.tv_sec ) + end.tv_usec - start.tv_usec;
    timeuse /= 1000000;
    printf("parallel time used:%f\n",timeuse);



    return 0;
}

the result is as below:
main:
19, Generating copy(y[0:1023])
Generating copy(x[0:1023])
Generating compute capability 2.0 binary
21, Loop is parallelizable
Accelerator kernel generated
21, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
CC 2.0 : 10 registers; 4 shared, 44 constant, 0 local memory bytes
25, Loop is parallelizable
Accelerator kernel generated
25, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
CC 2.0 : 10 registers; 4 shared, 44 constant, 0 local memory bytes
36, Accelerator kernel generated
39, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
CC 2.0 : 12 registers; 0 shared, 48 constant, 0 local memory bytes
44, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
36, Generating copy(y[0:1023])
Generating copy(x[0:1023])
Generating compute capability 2.0 binary
39, Loop is parallelizable
44, Loop is parallelizable

for kernel instruction,CC 2.0 : 10 registers; 4 shared, 44 constant, 0 local memory bytes
for parallel indstuction,CC 2.0 : 12 registers; 0 shared, 48 constant, 0 local memory bytes
I still couldn't figure out their difference.

Please kindly help me
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Mon Jun 11, 2012 11:09 am    Post subject: Reply with quote

Hi Telsalady,

For the "kernels" regions, you're assuming that the compiler knows how best to accelerate your code. The compiler will do it's analysis, determine where to generate kernels, determine the loop schedule, etc. With additional directives, you have the power to override what the compiler does, or give it more information it can use in its analysis.

For the "parallel" regions, the compiler is assuming you know how best to accelerate your code. You need to determine where to put your kernels and how to divide up the schedules. What ever you use, even if it could lead to wrong answers, will be used.

One of the current limitations of the "kernels" regions is that it can only accelerate tightly nested loops. Because of this, in your code, the compiler will create two kernels, one for each of the loops.

In a "parallel" region, a single kernel will be created spanning the entire region. This allows you create kernels out of non-tightly nested loops. Any serial code will be executed redundantly by all threads. Loops will be divided up among the threads in a gang. Hence, in your second section, only one kernel is generated where each gang will execute each loop in a inner parallel vector loop. However for your code, you have a potential race condition since the second loop should not be executed before the first is completely finished. Since the isn't any global synchronization between gangs, some of the data shared between gangs may not be updated in time. Hence, I would suggest changing your code to use two "parallel" regions surrounded by a "data" region.

Hope this helps,
Mat
Back to top
View user's profile
Teslalady



Joined: 16 Mar 2012
Posts: 75

PostPosted: Tue Jun 12, 2012 8:48 am    Post subject: Reply with quote

Hi,Mat,thanks very much, I'm clear now.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group