PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Inlining with pragmas
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
nmearl



Joined: 02 Aug 2013
Posts: 3

PostPosted: Tue Apr 22, 2014 12:02 pm    Post subject: Inlining with pragmas Reply with quote

Hey there,

Is it possible to inline functions that also have pragmas declared? It seems I can inline functions if the function itself is small and doesn't contain any loops.

In the case where the function contains loops and I try to inline it into a function that is already wrapped with pragmas, I get yelled at with "Complex loop carried dependence of..." and "Loop carried reuse of...".

If I add pragmas to the function I want to inline, I get "Accelerator restriction: function/procedure calls are not supported".

The function calls exist in the same file, and I'm simply adding -Minline:fund to the command line when I compile.

Thanks for your time.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Tue Apr 22, 2014 12:32 pm    Post subject: Reply with quote

Have you tried using the OpenACC "Routine" pragma? It will allow you to call the function from the compute region so you don't need to inline anything.

For the dependency messages, try adding "-Msafeptr". It may be that the compiler can't tell if your pointers are aliased or not. This flag asserts that all pointers are disjoint.

Finally, without the "routine" directive, no you can't put pragma in the function to be inlined.

- Mat
Back to top
View user's profile
nmearl



Joined: 02 Aug 2013
Posts: 3

PostPosted: Tue Apr 22, 2014 11:19 pm    Post subject: Reply with quote

Thanks Mat,

I went ahead and just removed the pragmas from the inlined function for now. However, I'm getting a strange error that I'm not sure how to interpret. My input code (not stand-alone, but the only part that contains accelerator syntax)

Code:
void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
{
    // register int m;
    double rr[N][N][3],rp[3],temp,temp2,temp3;

    #pragma acc region copyin(N, mass[:N])
    {
        #pragma acc loop //for private(rp[:3])
        for (int j = 0; j < N; j++) {   
            #pragma acc loop
            for (int i = 0; i < N; i++) {
                if (i >= j) continue;
                rij(r,mass,eta,i,j,rp);
                temp = norm3(rp);

                // rp[0]*temp^3
                rr[i][j][0] = (1/temp)*rp[0];
                rr[i][j][1] = (1/temp)*rp[1];
                rr[i][j][2] = (1/temp)*rp[2];
            }
        }
    //}


    A3(ro[0],v[0]);
    vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;

    //pragma acc kernels copyin(mass[:N])
    //{
        #pragma acc loop
            ro[i][0] = v[i][0];//A3(ro[i],v[i]);
            vo[i][0] = (temp*mass[0])*rr[0][i][0];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
            ro[i][1] = v[i][1];//A3(ro[i],v[i]);
            vo[i][1] = (temp*mass[0])*rr[0][i][1];
            ro[i][2] = v[i][2];//A3(ro[i],v[i]);
            vo[i][2] = (temp*mass[0])*rr[0][i][2];

            #pragma acc loop
            for (int j = 1; j < N; j++) {
                if (j > i-1) continue;
                vo[i][0] += (temp*mass[j])*rr[j][i][0];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                vo[i][1] += (temp*mass[j])*rr[j][i][1];
                vo[i][2] += (temp*mass[j])*rr[j][i][2];
            }
            #pragma acc loop
            for (int j=i+1; j<=N-1; j++) {
                vo[i][0] += -(mass[j])*(rr[i][j][0]); //A3P(vo[i],-mass[j]*rr[i][j]);
                vo[i][1] += -(mass[j])*(rr[i][j][1]);
                vo[i][2] += -(mass[j])*(rr[i][j][2]);
            }
            #pragma acc loop
            for (int j=0; j <= i-1; j++) {
                #pragma acc loop
                for (int k = i+1; k <= N-1; k++) {
                    vo[i][0] += (mass[k]*mass[j]*temp3)*(rr[j][k][0]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                    vo[i][1] += (mass[k]*mass[j]*temp3)*(rr[j][k][1]);
                    vo[i][2] += (mass[k]*mass[j]*temp3)*(rr[j][k][2]);
                }
            }
        }
    }
}


And I'm getting this error

Code:
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (source/n_body.cpp: 129)
rhs(double (*)[3], double (*)[3], double (*)[3], double (*)[3], double *, double *, int):
    129, Generating copyin(N)
         Generating copyin(mass[:N])
    131, Loop is parallelizable
    133, Loop is parallelizable
          84, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
         105, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
    147, Loop is parallelizable
         Accelerator scalar kernel generated
    153, Loop is parallelizable
    165, Loop is parallelizable
    172, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable
    178, Loop is parallelizable
    180, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable


Any ideas?

Thanks for your time.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Wed Apr 23, 2014 10:41 am    Post subject: Reply with quote

Hi nmearl,

The "Unexpected flow graph" is a compiler error. Your example doesn't compile as is but after a few changes, I was able to reproduce the error and I've filed a problem report (TPR#20361). However, the code itself is bad and this is causing the unexpected compiler behavior.

I took the liberty of rewriting the code to what I think you wanted. It appears to me that you are missing a second "i" loop since you're accessing ro[i][0] when there's no "i" in scope. Second, reductions are only allowed on scalars so I replaced the intermediary vo accumulations with scalars. This will also help cut down on the number of memory access. I also added a data region and split the kernels into two separate regions. Keeping it as it was would mean you would have one outer gang region with several inner vector loops because of the scalar assignments to "vo" in the middle and the call to A3. Note that I did have to comment out all the calls.

Granted, I have no idea if these changes are correct or what you intended so please modify as needed.

Here's my modified version:

Code:
% cat test_fixed.c
void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
 {
     // register int m;
     double rr[N][N][3],rp[3],temp,temp2,temp3;
     double vo0,vo1,vo2;

     #pragma acc data copyin(N, mass[:N])
{
     #pragma acc kernels
     {
         #pragma acc loop //for private(rp[:3])
         for (int j = 0; j < N; j++) {
             #pragma acc loop
             for (int i = 0; i < N; i++) {
                 if (i >= j) continue;
               ///  rij(r,mass,eta,i,j,rp);
               ///  temp = norm3(rp);

                 // rp[0]*temp^3
                 rr[i][j][0] = (1/temp)*rp[0];
                 rr[i][j][1] = (1/temp)*rp[1];
                 rr[i][j][2] = (1/temp)*rp[2];
             }
         }
     }


///     A3(ro[0],v[0]);
     vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;
     //#pragma acc kernels copyin(mass[:N])
     //{
      #pragma acc kernels loop copy(vo[0:N][0:3], ro[0:N][0:3])
      for (int i = 0; i < N; i++) {
             ro[i][0] = v[i][0];//A3(ro[i],v[i]);
             vo0 = (temp*mass[0])*rr[0][i][0];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
             ro[i][1] = v[i][1];//A3(ro[i],v[i]);
             vo1 = (temp*mass[0])*rr[0][i][1];
             ro[i][2] = v[i][2];//A3(ro[i],v[i]);
             vo2 = (temp*mass[0])*rr[0][i][2];

             #pragma acc loop reduction(+:vo0,vo1,vo2)
             for (int j = 1; j < N; j++) {
                 if (j > i-1) continue;
                 vo0 += (temp*mass[j])*rr[j][i][0];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                 vo1 += (temp*mass[j])*rr[j][i][1];
                 vo2 += (temp*mass[j])*rr[j][i][2];
             }
             #pragma acc loop reduction(+:vo0,vo1,vo2)
             for (int j=i+1; j<=N-1; j++) {
                 vo0 += -(mass[j])*(rr[i][j][0]); //A3P(vo[i],-mass[j]*rr[i][j]);
                 vo1 += -(mass[j])*(rr[i][j][1]);
                 vo2 += -(mass[j])*(rr[i][j][2]);
             }
             #pragma acc loop reduction(+:vo0,vo1,vo2)
             for (int j=0; j <= i-1; j++) {
             #pragma acc loop reduction(+:vo0,vo1,vo2)
                 for (int k = i+1; k <= N-1; k++) {
                     vo0 += (mass[k]*mass[j]*temp3)*(rr[j][k][0]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                     vo1 += (mass[k]*mass[j]*temp3)*(rr[j][k][1]);
                     vo2 += (mass[k]*mass[j]*temp3)*(rr[j][k][2]);
                 }
             }
             vo[i][0] +=  vo0;
             vo[i][1] +=  vo1;
             vo[i][2] +=  vo2;
        }
    }
 }


% pgcc -c test_fixed.c -acc -Minfo=accel -Msafeptr -V14.3
rhs:
      7, Generating copyin(N)
         Generating copyin(mass[:N])
      9, Generating present_or_copyin(rr[:N][:N][:])
         Generating present_or_copyin(rp[:])
         Generating NVIDIA code
     12, Loop is parallelizable
     14, Loop is parallelizable
         Accelerator kernel generated
         12, #pragma acc loop gang /* blockIdx.y */
         14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     32, Generating copy(vo[:N][:])
         Generating copy(ro[:N][:])
         Generating present_or_copyin(rr[:][:][:])
         Generating present_or_copyin(v[:N][:])
         Generating NVIDIA code
     33, Loop is parallelizable
         Accelerator kernel generated
         33, #pragma acc loop gang /* blockIdx.x */
         42, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         49, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         55, Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
         57, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for vo0
             Sum reduction generated for vo1
             Sum reduction generated for vo2
     42, Loop is parallelizable
     49, Loop is parallelizable
     55, Loop is parallelizable
     57, Loop is parallelizable


Here's the test case I submitted to reproduce the flow graph error:
Code:
% cat test.c
void rhs(double r[][3], double v[][3], double ro[][3], double vo[][3],double * mass, double * eta, int N)
 {
     // register int m;
     double rr[N][N][3],rp[3],temp,temp2,temp3;
     #pragma acc kernels copyin(N, mass[:N])
     {
         #pragma acc loop //for private(rp[:3])
         for (int j = 0; j < N; j++) {
             #pragma acc loop
             for (int i = 0; i < N; i++) {
                 if (i >= j) continue;
               //  rij(r,mass,eta,i,j,rp);
               //  temp = norm3(rp);

                 // rp[0]*temp^3
                 rr[i][j][0] = (1/temp)*rp[0];
                 rr[i][j][1] = (1/temp)*rp[1];
                 rr[i][j][2] = (1/temp)*rp[2];
             }
         }
     //}


//     A3(ro[0],v[0]);
     vo[0][0] = 0; vo[0][1] = 0; vo[0][2] = 0;
     //#pragma acc kernels copyin(mass[:N])
     //{
     // #pragma acc loop
      int i = 0;
             ro[i][0] = v[i][0];//A3(ro[i],v[i]);
             vo[i][0] = (temp*mass[0])*rr[0][i][0];//A3(vo[i],(eta[i]/eta[i-1]*mass[0])*rr[0][i]);
             ro[i][1] = v[i][1];//A3(ro[i],v[i]);
             vo[i][1] = (temp*mass[0])*rr[0][i][1];
             ro[i][2] = v[i][2];//A3(ro[i],v[i]);
             vo[i][2] = (temp*mass[0])*rr[0][i][2];

             #pragma acc loop
             for (int j = 1; j < N; j++) {
                 if (j > i-1) continue;
                 vo[i][0] += (temp*mass[j])*rr[j][i][0];//A3P(vo[i],temp*mass[j]*rr[j][i]);
                 vo[i][1] += (temp*mass[j])*rr[j][i][1];
                 vo[i][2] += (temp*mass[j])*rr[j][i][2];
             }
             #pragma acc loop
             for (int j=i+1; j<=N-1; j++) {
                 vo[i][0] += -(mass[j])*(rr[i][j][0]); //A3P(vo[i],-mass[j]*rr[i][j]);
                 vo[i][1] += -(mass[j])*(rr[i][j][1]);
                 vo[i][2] += -(mass[j])*(rr[i][j][2]);
             }
             #pragma acc loop
             for (int j=0; j <= i-1; j++) {
                 #pragma acc loop
                 for (int k = i+1; k <= N-1; k++) {
                     vo[i][0] += (mass[k]*mass[j]*temp3)*(rr[j][k][0]);//A3P(vo[i],mass[j]*mass[k]*temp*rr[j][k]);
                     vo[i][1] += (mass[k]*mass[j]*temp3)*(rr[j][k][1]);
                     vo[i][2] += (mass[k]*mass[j]*temp3)*(rr[j][k][2]);
                 }
             }
        }
//     }
 }


% pgcc -c test.c -acc -Minfo=accel -Msafeptr -V14.3
PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected flow graph (test.c: 5)
rhs:
      5, Generating copyin(N)
         Generating copyin(mass[:N])
      8, Loop is parallelizable
     10, Loop is parallelizable
     20, Accelerator scalar kernel generated
     38, Complex loop carried dependence of '*(vo)' prevents parallelization
         Parallelization requires privatization of '*(vo)' as well as last value
     45, Complex loop carried dependence of '*(vo)' prevents parallelization
         Parallelization requires privatization of '*(vo)' as well as last value
     51, Parallelization would require privatization of array 'vo[:i1+1][:]'
     53, Complex loop carried dependence of '*(vo)' prevents parallelization
         Loop carried dependence due to exposed use of 'vo[:i1+1][:]' prevents parallelization
PGC/x86-64 Linux 14.3-0: compilation completed with severe errors


Thanks,
Mat
Back to top
View user's profile
nmearl



Joined: 02 Aug 2013
Posts: 3

PostPosted: Wed Apr 23, 2014 11:45 am    Post subject: Reply with quote

Thanks Mat! You rock!

I'm not sure how that i-loop got misplaced, but there is one in the actual file right where you put it. Sorry for that oversight.

Should I be concerned if when I compile it, I don't get the same messages? I.e. I don't get notified of the reductions in the loops

Code:
rhs(double (*)[3], double (*)[3], double (*)[3], double (*)[3], double *, double *, int):
    131, Generating copyin(N)
         Generating copyin(mass[:N])
    133, Generating present_or_copyin(rr[:N])
         Generating NVIDIA code
    135, Loop is parallelizable
         Accelerator kernel generated
        135, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    137, Loop carried reuse of 'rp' prevents parallelization
         Complex loop carried dependence of 'rp' prevents parallelization
         Complex loop carried dependence of '*(rr)' prevents parallelization
         Loop carried reuse of '*(rr)' prevents parallelization
          84, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
         105, Complex loop carried dependence of 'rp' prevents parallelization
              Loop carried reuse of 'rp' prevents parallelization
              Inner sequential loop scheduled on accelerator
    151, Generating copy(vo[:N][:])
         Generating copy(ro[:N][:])
         Generating present_or_copyin(rr[:N])
         Generating present_or_copyin(v[:N][:])
         Generating NVIDIA code
    155, Loop is parallelizable
         Accelerator kernel generated
        155, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    167, Loop is parallelizable
    173, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable
    179, Loop is parallelizable
    181, Accelerator restriction: size of the GPU copy of 'rr' is unknown
         Loop is parallelizable


Also, that inner part with 84 and 105 is the reference to another function. I went ahead and tried to implement the acc routine, but the compiler tells me that

Code:
"source/n_body.cpp", line 65: warning: invalid text in pragma
  #pragma acc routine
              ^


My compiler is
Code:
pgc++ 14.3-0 64-bit target on x86-64 Linux -tp piledriver
The Portland Group - PGI Compilers and Tools
Copyright (c) 2014, NVIDIA CORPORATION.  All rights reserved.


Thanks again for your help,
Nick
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group