PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Course

Reduction prevents parallel execution on two GPUs
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Thu Mar 06, 2014 1:23 am    Post subject: Reduction prevents parallel execution on two GPUs Reply with quote

Hi,
Again I use a very simple Jacobi solver to solve the equation on two GPUs. I use acc_set_device_num(0,acc_device_nvidia) and acc_set_device_num(1,acc_device_nvidia) for setting what should be executed on device 0 and device 1. I do not use several MPI or OpenMP processes/threads to do so. I distribute the rows of the matrix to the both devices.
However, the first Jacobi computation need a reduction (see below). This reduction prevents that both parts are executed in parallel on both GPUs. I do not understand why. If I uncomment the reduction computation for device 0, I can see in the NVIDIA Profiler that both kernels run in parallel as you would expect.
Is there a workaround for this issue? I tried the 14.1 and 14.2 compilers.
Sandra

Code:
    // j_cpu_start is the number of rows to compute on CPU
    // distribute other rows to two GPUs
    // put data on GPU 0 and 1
    while ( err > tol && iter < iter_max ) {

        acc_err0 = 0.0;
        acc_err1 = 0.0;
        err = 0.0;

        // do one part on GPU
        acc_set_device_num(0,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[0:(j_cpu_start/2+1)*m],A[0:(j_cpu_start/2+1)*m]) async reduction(max:acc_err0)
        for( j = 1; j < j_cpu_start/2; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err0 = fmax(acc_err0,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        acc_set_device_num(1,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m],A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m]) async reduction(max:acc_err1)
        for( j = j_cpu_start/2; j < j_cpu_start; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err1 = fmax(acc_err1,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        // do other part on CPU
        #pragma omp parallel for default(none) shared(A,Anew,m,n,j_cpu_start) private(i,j) reduction(max:err)
        for( j = j_cpu_start; j<n-1; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }
        // exchange halo
        // do matrix swap
    }
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6700
Location: The Portland Group Inc.

PostPosted: Thu Mar 06, 2014 12:22 pm    Post subject: Reply with quote

Hi Sandra,

It's probably not the reduction itself but rather the copying back of the reduction variable to the host that's causing the problem. Try putting acc_err0 and acc_err1 into a data region and then manage when the variables are copied back using the update directive. Something like:
Code:
   // j_cpu_start is the number of rows to compute on CPU
    // distribute other rows to two GPUs
    // put data on GPU 0 and 1
    while ( err > tol && iter < iter_max ) {

        acc_err0 = 0.0;
        acc_err1 = 0.0;
        err = 0.0;
       acc_set_device_num(1,acc_device_nvidia);
       #pragma acc update device(acc_err1)

        // do one part on GPU
        acc_set_device_num(0,acc_device_nvidia);
       #pragma acc update device(acc_err0)
        #pragma acc parallel loop present(Anew[0:(j_cpu_start/2+1)*m],A[0:(j_cpu_start/2+1)*m], acc_err0) async reduction(max:acc_err0)
        for( j = 1; j < j_cpu_start/2; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err0 = fmax(acc_err0,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        acc_set_device_num(1,acc_device_nvidia);
        #pragma acc parallel loop present(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m],A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m], acc_err1) async reduction(max:acc_err1)
        for( j = j_cpu_start/2; j < j_cpu_start; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                acc_err1 = fmax(acc_err1,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        // do other part on CPU
        #pragma omp parallel for default(none) shared(A,Anew,m,n,j_cpu_start) private(i,j) reduction(max:err)
        for( j = j_cpu_start; j<n-1; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        #pragma acc update host(acc_err1)
        acc_set_device_num(0,acc_device_nvidia);
        #pragma acc update device(acc_err0)

        // exchange halo
        // do matrix swap
    }

Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Fri Mar 07, 2014 1:00 am    Post subject: Reply with quote

Hi Mat,
I believe you have a copy-paste error in your code. I assume that the last "update device" should actually be a "update host".
First, I did not see that acc_err0/1 are denoted in the present clauses of the kernels. The consequence was (1) I got wrong results and (2) the kernels did not run in parallel. But, finally, I figured out that they definitely have to be in the present clauses.
With these changes, the code is finally running fast and correct.
But, I see this rather as a workaround (thanks that this is working now!)... or is it really how it should work? I rather see this as compiler issue, isn't it?

BTW: I used unstructured data lifetimes (see below) for putting data on both devices. How can you use data regions with two devices? Must they be nested?

Code:
    acc_set_device_num(0,acc_device_nvidia);
    #pragma acc enter data create(Anew[0:(j_cpu_start/2+1)*m])
    #pragma acc enter data copyin(A[0:(j_cpu_start/2+1)*m])
    #pragma acc enter data create(acc_err0)

    acc_set_device_num(1,acc_device_nvidia);
    #pragma acc enter data create(Anew[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m])
    #pragma acc enter data copyin(A[(j_cpu_start/2-1)*m:(j_cpu_start/2+2)*m])
    #pragma acc enter data create(acc_err1)


Thanks, Mat, for your help.
Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6700
Location: The Portland Group Inc.

PostPosted: Fri Mar 07, 2014 10:57 am    Post subject: Reply with quote

Quote:
I assume that the last "update device" should actually be a "update host".
Yes, thanks for catching that. I didn't actually test the code since I didn't have the complete example.

Quote:
But, I see this rather as a workaround (thanks that this is working now!)... or is it really how it should work? I rather see this as compiler issue, isn't it?
The OpenACC spec states that reduction results are available after the compute region. Since both acc_err0 and acc_err1 were host variables, their values needed to get updated immediately after the compute region.

I think what you're saying is that since this kernel has the async clause, the compiler should delay copying back the reduction's value. The question would be when does it get copied back? The compiler can't guess so needs a well defined spot and the only well defined spot is immediately after the kernel ends. Ideally, there'd be some call back from the device, but this doesn't exist so the host needs to block waiting for the reduction variable to come back.

Of course, we're open to suggestions!

- Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Mon Mar 10, 2014 7:39 am    Post subject: Reply with quote

Dear Mat,
I see the point. But sometimes, you (as user) just miss a tiny detail in the specification and then you get confused.
Nevertheless, the reduction in combintation with the async clause suggests a different behavior. Why can't you defer the copy back to a acc wait? That is what I used anyway before updating. Or at least a warning during compilation...
Sandra
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group