PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

dependence in loop prevents parallelization

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Fri Feb 05, 2010 1:08 am    Post subject: dependence in loop prevents parallelization Reply with quote

Hi,
I'm using pgi 10.1 on linux 64bit.

I have problems with dependencies in loops, since I get the following messages when compiling with "pgcc -g -ta=nvidia,cc11 -Minfo -fastsse -c ./main.c -o main.o":
Code:
calc:
     11, No parallel kernels found, accelerator region ignored
     15, Complex loop carried dependence of 'fArr2' prevents parallelization
     17, Complex loop carried dependence of 'fArr2' prevents parallelization
         Generated 4 alternate loops for the loop
         Generated vector sse code for the loop
main:
     38, Loop unrolled 4 times (completely unrolled)
     39, Loop unrolled 4 times (completely unrolled)


Everywhere it is said that I should either use the restrict keyword or the option -Msafteptr, but neither of these is working for my case.

I reduced my program to an small example code where I copy stuff from one array to another (see below). I know it does not really make sense what I am doing there, but first I want to get rid of these dependencies.
Does anyone has an idea?
Code:
#include <stdio.h>
#include <stdlib.h>

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
#pragma acc for private(fVal,i,j)
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j*iCols+i];
            fArr2[j*iCols+i] = 2.0f * fVal;
          }
      }
       }
}

int main (int argc, char** argv)
{
    int retVal = 0;    /* return value */
    int i,j;
    int iCols = 4;
    int iRows = 4;

    /*Init arrays*/
    float *fArr1 = (float*) malloc(iCols * iRows * sizeof(float));
    float *fArr2 = (float*) malloc(iCols * iRows * sizeof(float));
 
    for (j=0; j< iRows; ++j){
      for(i=0; i< iCols; ++i){
   fArr1[j*iCols+i] = i;
   fArr2[j*iCols+i] = 0.0f;
      }
    }

    if (fArr1 && fArr2)
    {
        /* running calculations */
      calc(fArr1,fArr2,iCols,iRows);

      /* print one example result */
      printf("Result[%d]: %f\n", iRows*iCols-1,fArr2[iRows*iCols-1]);
    }
    else
    {
        printf(" Memory allocation failed ...\n");
        retVal = -1;
    }

    /* cleanup */
    free(fArr1);
    free(fArr2);
   
    return retVal;
}
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6136
Location: The Portland Group Inc.

PostPosted: Fri Feb 05, 2010 10:13 am    Post subject: Reply with quote

Hi Xray,

For data parallelism, the operations on each elements of your array must be independent. It's obvious to us in this case that the computed index resolves to unique index, it's not easily determined by the compiler. While we are working on support to handle these simpler cases, in the time being you'll need to do one of three options.

1) Force parallelization

You can force parallelization by using the "#pragma acc for parallel" directive before the "j" loop. Note that I would not recommend users do this unless they are sure all computed indices are unique.

Code:
void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j,idx;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
#pragma acc for parallel
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j*iCols+i];
            fArr2[j*iCols+i] = 2.0f * fVal;
          }
      }
       }
}



Side note, using the private clause is not necessary since scalars are privatized by default.


2) Remove the "i" loop

While I don't know your full source, in this example you can remove the "i" loop altogether.

Code:
void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
         /* compute stencil, residual and update */
         for (j = 0; j < n; j++)
      {
            fVal = 5.0f * fArr1[j];
            fArr2[j] = 2.0f * fVal;
      }
       }
}


3) Use multi-dimensional indexing

Instead of computing the index, you could use multi-dimensional indexing. Though, this would require you to change the entire program.

Code:
#include <stdio.h>
#include <stdlib.h>

void calc(float **restrict fArr1, float **restrict fArr2, int iCols, int iRows)
{
  int i,j,idx;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copyin(fArr1[0:iRows-1][0:iCols-1]), copyout(fArr2[0:iRows-1][0:iCols-1])
       {
#pragma acc for parallel
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j][i];
            fArr2[j][i] = 2.0f * fVal;
          }
      }
       }
}

int main (int argc, char** argv)
{
    int retVal = 0;    /* return value */
    int i,j;
    int iCols = 4;
    int iRows = 4;

    /*Init arrays*/
    float **fArr1 = (float**) malloc(iRows * sizeof(float*));
    float **fArr2 = (float**) malloc(iRows * sizeof(float*));
    if (fArr1 && fArr2)
    {

    for (j=0; j< iRows; ++j){
      fArr1[j] = (float*) malloc(iCols * sizeof(float));
      fArr2[j] = (float*) malloc(iCols * sizeof(float));
      for(i=0; i< iCols; ++i){
   fArr1[j][i] = i;
   fArr2[j][i] = 0.0f;
      }
    }
        /* running calculations */
      calc(fArr1,fArr2,iCols,iRows);

      /* print one example result */
      printf("Result[%d][%d]: %f\n", iRows-1,iCols-1,fArr2[iRows-1][iCols-1]);
    }
    else
    {
        printf(" Memory allocation failed ...\n");
        retVal = -1;
    }

    /* cleanup */
    free(fArr1);
    free(fArr2);

    return retVal;
}


Side note, I changed the fArr1 to use the "copyin" clause and fArr2 to use the copyout clause.

Hope this helps,
Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Tue Feb 09, 2010 6:49 am    Post subject: reduction Reply with quote

Thanks for your answers.
I tried all solution possibilites with my sample code and they all worked. But, what I've recognised is that if I force parallelisation I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?

And one more question:
In my bigger code I have a reduction and with that it's not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: "pgnvd job exited with nonzero status code 0".
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn't it?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6136
Location: The Portland Group Inc.

PostPosted: Tue Feb 09, 2010 9:51 am    Post subject: Reply with quote

Hi Xray,

Quote:
I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?
The compiler is still parallelizing the code but just letting you know that it thinks you shouldn't be doing it.

Quote:
In my bigger code I have a reduction and with that it's not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: "pgnvd job exited with nonzero status code 0".
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn't it?

Support for reductions is brand new so may have problems. Please get the example code to how you think it should work and then send a report to PGI Customer Support (trs@pgroup.com). We'll have one of our engineers (probably me) take a look and let you know what's going on.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group