PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

How to parallelize this loop...
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Thu Sep 20, 2012 8:08 am    Post subject: How to parallelize this loop... Reply with quote

Hello,

can somebody please tell me how to parallelize the inner loop using the kernels construct?
The compiler keeps telling me that it is parallelizable but it refuses to parallelize it.

Code:

inline void matvec(const struct MatrixCRS* A, restrict const float* g, restrict float* y){
   int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict float *value = A->value;

#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
    {
#pragma acc loop independent
        for(i=0; i<n; i++){
            float tmp = 0.0;
#pragma acc loop independent reduction(+:tmp)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}


The output is:
Code:

matvec:                                                                       
     54, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 2.0 binary
     57, Loop is parallelizable
         Accelerator kernel generated
         57, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Cached references to size [(x+1)] block of 'ptr'
             CC 2.0 : 23 registers; 0 shared, 100 constant, 0 local memory bytes
     60, Loop is parallelizable


If I read the compiler feedback correctly this parallel version parallelized the inner loop as well:
Code:

inline void matvec(const struct MatrixCRS* A, restrict const floatType* g, restrict floatType* y){
   int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict floatType *value = A->value;

#pragma acc parallel present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n]) vector_length(32)
    {
#pragma acc loop gang
        for(i=0; i<n; i++){
            floatType tmp = 0.0;
#pragma acc loop vector reduction(+:tmp)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}


This is the respective compiler feedback:
Code:

matvec:
     54, Accelerator kernel generated
         54, CC 2.0 : 20 registers; 32 shared, 92 constant, 0 local memory bytes
         57, #pragma acc loop gang /* blockIdx.x */
         60, #pragma acc loop vector(32) /* threadIdx.x */
     54, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 2.0 binary
     60, Loop is parallelizable


Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Thu Sep 20, 2012 8:54 am    Post subject: Reply with quote

Hi Paul,

Since the compiler has scheduled the outer loop as a "gang, vector(128)", it ignores the inner loop directive. Scheduling outer loop as the vector makes in impossible to perform the parallel sum reduction in the inner loop. You can override the compiler's choice by explicitly setting the inner loop schedule to use a "vector".

Try both schedules and see which one is optimal for your code. I've seen both but more often find parallelizing just the outer loop faster.

Hope this helps,
Mat

Code:
% cat test.c
struct MatrixCRS {
int n;
int nnz;
int * ptr;
int * index;
float * value;
};


void matvec(const struct MatrixCRS* A, restrict const float* g, restrict float* y){
   int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict float *value = A->value;

#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
    {
#pragma acc loop independent
        for(i=0; i<n; i++){
            float tmp = 0.0;
#pragma acc loop independent vector reduction(+:tmp)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}
% pgcc -c test.c -Minfo=accel -Msafeptr -acc
matvec:
     18, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     21, Loop is parallelizable
         Accelerator kernel generated
         21, #pragma acc loop gang /* blockIdx.x */
             CC 1.0 : 16 registers; 96 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 16 registers; 32 shared, 92 constant, 0 local memory bytes
         24, #pragma acc loop vector(128) /* threadIdx.x */
         Loop is parallelizable
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Thu Sep 20, 2012 9:29 am    Post subject: Reply with quote

Hi Mat,

thanks for your quick help. Unfortunately the compiler still tries to schedule the outer loop among vectors. However it seems to parallelize the inner loop.

Code:

inline void matvec(const struct MatrixCRS* A, restrict const floatType* g, restrict floatType* y){
   int i,j;
    int n = A->n;
    int nnz = A->nnz;
    restrict int *ptr = A->ptr;
    restrict int *index = A->index;
    restrict floatType *value = A->value;

#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
    {
#pragma acc loop independent
        for(i=0; i<n; i++){
            floatType tmp = 0.0;
#pragma acc loop independent vector reduction(+:tmp)
//#pragma unroll(32)
            for(j=ptr[i]; j<ptr[i+1]; j++){
                tmp+=value[j]*g[index[j]];
            }
            y[i]=tmp;
        }
    }
}     53, Generating present(y[0:n])
         Generating present(g[0:n])
         Generating present(value[0:nnz])
         Generating present(index[0:nnz])
         Generating present(ptr[0:n+1])
         Generating compute capability 2.0 binary
     56, Loop is parallelizable
         Accelerator kernel generated
         56, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
             Cached references to size [(x+1)] block of 'ptr'
             CC 2.0 : 22 registers; 16 shared, 92 constant, 0 local memory bytes
         60, #pragma acc loop vector(32) /* threadIdx.y */
         Loop is parallelizable




This is still true if I add "gang" to the outer loop.

I'm using pgi 12.8 with the following options:
-Minfo=accel -acc -ta=nvidia,4.1,cc20

Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Fri Sep 21, 2012 1:04 pm    Post subject: Reply with quote

Hi Paul,

Quote:
I'm using pgi 12.8 with the following options

I was using 12.9 and hence the difference.

What's the performance delta between the various schedules? Are there any correctness issues?

- Mat
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Mon Sep 24, 2012 9:29 am    Post subject: Reply with quote

Hi Mat,

I the Version which parallelizes the inner loop runs for 4.5 seconds while the version which does not
parallelize the inner loop takes 6.15 seconds. If I change the vector_length from 256 (chosen by the compiler)
to 32, I was able to increase the runtime to 2.67 seconds. So it makes quite a difference.

Moreover, the last version I've posted does not return correct results even though the compiler does not
issue any warnings.

Best,
Paul
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group