PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

a 3 levels of loop

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Teslalady



Joined: 16 Mar 2012
Posts: 75

PostPosted: Thu Sep 06, 2012 9:42 am    Post subject: a 3 levels of loop Reply with quote

Dear Mat,

I have a code with 3levels of loop. I tried to use openACC to accelerate the outside loop as the attached.

Code:
#pragma acc kernels copy(l[:N*N],u[:N*N]) copyin(a[:N*N]) local(sum)
for(i=0; i<n-1; i++)
        {
               
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i)
                                        {
                                           sum += l[j][k]*u[k][i];
                                        }
                                }
                                l[j][i] = (float)((a[j][i]-sum)/u[i][i]);
                        }
                }

               
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i+1)
                                        {
                                          sum += l[i+1][k]*u[k][j];
                                        }
                                }
                                u[i+1][j] = (float)((a[i+1][j]-sum));
                        }
                }

        }
but i found the result is not the same with CPU code. and I also try to accelerate the inner loop, but failed.
Can you give me some suggestions?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Sep 06, 2012 11:13 am    Post subject: Reply with quote

Hi Sisiy,
Quote:
but i found the result is not the same with CPU code

Most likely it's because you put "sum" in a local clause. This makes it a global variable shared by all threads. Please remove this clause and try again. If that still doesn't fix it, please post or send me a reproducing example.

Quote:
and I also try to accelerate the inner loop, but failed.
Do you mean the "j" or "k" loop? The "j" loop should accelerate assuming the compiler hasn't found some dependency (as indicated in the compiler feedback messages -Minfo=accel). The "k" loops wont accelerate due to the "if" statement. Though, you could use the "parallel" model instead, collapse the i and j loops into a "gang loop" and then parallelize the "k" loops with a "vector loop". Something along the lines of:
Code:

#pragma data copy(l[:N*N],u[:N*N]) copyin(a[:N*N])
#pragma acc parallel
{
#pragma loop collapse(2) gang
for(i=0; i<n-1; i++)
        {
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
#pragma acc loop vector
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i)
                                        {
                                           sum += l[j][k]*u[k][i];
                                        }
                                }
                                l[j][i] = (float)((a[j][i]-sum)/u[i][i]);
                        }
                }
} // end first parallel region
#pragma acc parallel
{
#pragma loop collapse(2) gang
for(i=0; i<n-1; i++)
        {
#pragma acc loop vector
                for(j=0; j<n; j++)
                {
                        if(j>i)
                        {
                                for(k=0,sum=0; k<n; k++)
                                {
                                        if(k != i+1)
                                        {
                                          sum += l[i+1][k]*u[k][j];
                                        }
                                }
                                u[i+1][j] = (float)((a[i+1][j]-sum));
                        }
                }
        }
}  // end second parallel region
}  // end data region


Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group