| View previous topic :: View next topic |
| Author |
Message |
xray
Joined: 21 Jan 2010 Posts: 71
|
Posted: Wed Feb 20, 2013 7:27 am Post subject: Computing multiple elements per thread in OpenACC |
|
|
Hi,
assume we have the following code:
| Code: | #pragma acc kernels
#pragma acc loop gang(16) vector(32)
for (int i=0; i<2048; i++) {
// do something with array[i]
} |
With PGI Compiler 12.9, this meant that we created a grid of size 16 and blocks of size 32 so that each CUDA thread would execute 4 elements.
However, with PGI Compiler 13.1 this is not possible anymore. If I denote vector and gang size, then the gang size will be ignored during execution (however, the compiler feedback will tell me that is uses 16 gangs). With 13.1, the compiler automatically executes the loop with a grid size of 64 (and vector size 32).
Is this a bug or intended? If the latter, why?
Kind regards, Sandra |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4995 Location: The Portland Group Inc.
|
Posted: Wed Feb 20, 2013 4:34 pm Post subject: |
|
|
Hi Sandra,
No, this doesn't look correct. I've opened up a problem report (TPR#19149) and sent it to our engineers for further investigation.
Thanks!
Mat |
|
| Back to top |
|
 |
xray
Joined: 21 Jan 2010 Posts: 71
|
Posted: Thu Feb 21, 2013 2:49 am Post subject: |
|
|
Thanks.
Just one addition: If I use a gang schedule for an outer loop, the vector schedule for the inner one of a loop nest and specify both sizes, then the specified size of the gang loop will also be ignored:
| Code: | #pragma acc parallel vector_length(64) num_gangs(128)
#pragma acc loop gang
for( int j = 0; j < n; j++)
{
#pragma acc loop vector
for( int i = 0; i < m; i++ ) {..}
} |
The output of ACC_NOTIFY shows that block=64, but grid=8190 (which is n in my case).
Sandra |
|
| Back to top |
|
 |
jtull
Joined: 30 Jun 2004 Posts: 233
|
Posted: Fri May 17, 2013 4:48 pm Post subject: |
|
|
Sandra,
TPR 19149 has been fixed in the current 13.5 release.
dave |
|
| Back to top |
|
 |
|