|
| View previous topic :: View next topic |
| Author |
Message |
xray
Joined: 21 Jan 2010 Posts: 71
|
Posted: Fri Feb 05, 2010 1:08 am Post subject: dependence in loop prevents parallelization |
|
|
Hi,
I'm using pgi 10.1 on linux 64bit.
I have problems with dependencies in loops, since I get the following messages when compiling with "pgcc -g -ta=nvidia,cc11 -Minfo -fastsse -c ./main.c -o main.o":
| Code: | calc:
11, No parallel kernels found, accelerator region ignored
15, Complex loop carried dependence of 'fArr2' prevents parallelization
17, Complex loop carried dependence of 'fArr2' prevents parallelization
Generated 4 alternate loops for the loop
Generated vector sse code for the loop
main:
38, Loop unrolled 4 times (completely unrolled)
39, Loop unrolled 4 times (completely unrolled) |
Everywhere it is said that I should either use the restrict keyword or the option -Msafteptr, but neither of these is working for my case.
I reduced my program to an small example code where I copy stuff from one array to another (see below). I know it does not really make sense what I am doing there, but first I want to get rid of these dependencies.
Does anyone has an idea?
| Code: | #include <stdio.h>
#include <stdlib.h>
void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
int i,j;
int n = iCols * iRows;
float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
{
#pragma acc for private(fVal,i,j)
/* compute stencil, residual and update */
for (j = 0; j < iRows; j++)
{
for (i = 0; i < iCols; i++)
{
fVal = 5.0f * fArr1[j*iCols+i];
fArr2[j*iCols+i] = 2.0f * fVal;
}
}
}
}
int main (int argc, char** argv)
{
int retVal = 0; /* return value */
int i,j;
int iCols = 4;
int iRows = 4;
/*Init arrays*/
float *fArr1 = (float*) malloc(iCols * iRows * sizeof(float));
float *fArr2 = (float*) malloc(iCols * iRows * sizeof(float));
for (j=0; j< iRows; ++j){
for(i=0; i< iCols; ++i){
fArr1[j*iCols+i] = i;
fArr2[j*iCols+i] = 0.0f;
}
}
if (fArr1 && fArr2)
{
/* running calculations */
calc(fArr1,fArr2,iCols,iRows);
/* print one example result */
printf("Result[%d]: %f\n", iRows*iCols-1,fArr2[iRows*iCols-1]);
}
else
{
printf(" Memory allocation failed ...\n");
retVal = -1;
}
/* cleanup */
free(fArr1);
free(fArr2);
return retVal;
} |
|
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Feb 05, 2010 10:13 am Post subject: |
|
|
Hi Xray,
For data parallelism, the operations on each elements of your array must be independent. It's obvious to us in this case that the computed index resolves to unique index, it's not easily determined by the compiler. While we are working on support to handle these simpler cases, in the time being you'll need to do one of three options.
1) Force parallelization
You can force parallelization by using the "#pragma acc for parallel" directive before the "j" loop. Note that I would not recommend users do this unless they are sure all computed indices are unique.
| Code: | void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
int i,j,idx;
int n = iCols * iRows;
float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
{
#pragma acc for parallel
/* compute stencil, residual and update */
for (j = 0; j < iRows; j++)
{
for (i = 0; i < iCols; i++)
{
fVal = 5.0f * fArr1[j*iCols+i];
fArr2[j*iCols+i] = 2.0f * fVal;
}
}
}
}
|
Side note, using the private clause is not necessary since scalars are privatized by default.
2) Remove the "i" loop
While I don't know your full source, in this example you can remove the "i" loop altogether.
| Code: | void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
int i,j;
int n = iCols * iRows;
float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
{
/* compute stencil, residual and update */
for (j = 0; j < n; j++)
{
fVal = 5.0f * fArr1[j];
fArr2[j] = 2.0f * fVal;
}
}
}
|
3) Use multi-dimensional indexing
Instead of computing the index, you could use multi-dimensional indexing. Though, this would require you to change the entire program.
| Code: | #include <stdio.h>
#include <stdlib.h>
void calc(float **restrict fArr1, float **restrict fArr2, int iCols, int iRows)
{
int i,j,idx;
int n = iCols * iRows;
float fVal;
#pragma acc region copyin(fArr1[0:iRows-1][0:iCols-1]), copyout(fArr2[0:iRows-1][0:iCols-1])
{
#pragma acc for parallel
/* compute stencil, residual and update */
for (j = 0; j < iRows; j++)
{
for (i = 0; i < iCols; i++)
{
fVal = 5.0f * fArr1[j][i];
fArr2[j][i] = 2.0f * fVal;
}
}
}
}
int main (int argc, char** argv)
{
int retVal = 0; /* return value */
int i,j;
int iCols = 4;
int iRows = 4;
/*Init arrays*/
float **fArr1 = (float**) malloc(iRows * sizeof(float*));
float **fArr2 = (float**) malloc(iRows * sizeof(float*));
if (fArr1 && fArr2)
{
for (j=0; j< iRows; ++j){
fArr1[j] = (float*) malloc(iCols * sizeof(float));
fArr2[j] = (float*) malloc(iCols * sizeof(float));
for(i=0; i< iCols; ++i){
fArr1[j][i] = i;
fArr2[j][i] = 0.0f;
}
}
/* running calculations */
calc(fArr1,fArr2,iCols,iRows);
/* print one example result */
printf("Result[%d][%d]: %f\n", iRows-1,iCols-1,fArr2[iRows-1][iCols-1]);
}
else
{
printf(" Memory allocation failed ...\n");
retVal = -1;
}
/* cleanup */
free(fArr1);
free(fArr2);
return retVal;
}
|
Side note, I changed the fArr1 to use the "copyin" clause and fArr2 to use the copyout clause.
Hope this helps,
Mat |
|
| Back to top |
|
 |
xray
Joined: 21 Jan 2010 Posts: 71
|
Posted: Tue Feb 09, 2010 6:49 am Post subject: reduction |
|
|
Thanks for your answers.
I tried all solution possibilites with my sample code and they all worked. But, what I've recognised is that if I force parallelisation I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?
And one more question:
In my bigger code I have a reduction and with that it's not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: "pgnvd job exited with nonzero status code 0".
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn't it? |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Tue Feb 09, 2010 9:51 am Post subject: |
|
|
Hi Xray,
| Quote: | I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?
| The compiler is still parallelizing the code but just letting you know that it thinks you shouldn't be doing it.
| Quote: | In my bigger code I have a reduction and with that it's not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: "pgnvd job exited with nonzero status code 0".
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn't it? |
Support for reductions is brand new so may have problems. Please get the example code to how you think it should work and then send a report to PGI Customer Support (trs@pgroup.com). We'll have one of our engineers (probably me) take a look and let you know what's going on.
Thanks,
Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|