PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

OpenACC vector length > 256 gives launch failure

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Fri Feb 28, 2014 3:06 am    Post subject: OpenACC vector length > 256 gives launch failure Reply with quote

Hi,
I have a very simple Jacobi example. However, if I play around with different vector sizes (loop schedules), I get the error "call to cuStreamSynchronize returned error 700: Launch failed" when taking sizes > 256. That is a problem for 14.2, 14.1, 13.10, 13.9. With 13.6 this code is correctly working.
I assume that is a compiler bug, isn't it? If so, I can send you the whole example if needed. If not, let me know what I am doing wrong. But actually, changing the loop schedule should not effect the output (as long as I am in a valid thread number range).
BTW: I use a NVIDIA Quadro 6000 (Fermi) and the flags -acc -ta:nvidia,cuda5.0

Code:
  #pragma acc data copy(A[0:m*n]) create(Anew[0:m*n])
  {
    while ( err > tol && iter < iter_max ) {

        err = 0.0;

        #pragma acc parallel loop present(Anew[0:n*m],A[0:n*m]) reduction(max:err) vector_length(257)
        for( j = 1; j < n-1; j++) {
            for( i = 1; i < m-1; i++ ) {
                Anew[j *m+ i] = 0.25 * ( A[j     *m+ (i+1)] + A[j     *m+ (i-1)]
                                     +   A[(j-1) *m+ i]     + A[(j+1) *m+ i]);
                err = fmax(err,fabs(Anew[j*m+i]-A[j*m+i]));
            }
        }

        #pragma acc parallel loop present(Anew[0:n*m],A[0:n*m])
        for( j = 1; j < n-1; j++) {
            for( i = 1; i < m-1; i++ ) {
                A[j *m+ i] = Anew[j *m+ i];
            }
        }

        if(iter % 10 == 0) {
            printf("%5d, %0.6f\n", iter, err);
        }

        iter++;
    } // end while
  }


The code uses dynamically allocated arrays.
Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Fri Feb 28, 2014 12:26 pm    Post subject: Reply with quote

Quote:
I assume that is a compiler bug, isn't it? If so, I can send you the whole example if needed.
It's possibly a compiler issue (maybe the reduction?) but it could be something else as well. Please do send me the program and I'll see what I can determine.

Thanks,
Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Thu Mar 06, 2014 1:12 am    Post subject: Reply with quote

I just did. Let me know if there is a workaround or a fix.
Thanks, Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Thu Mar 06, 2014 11:50 am    Post subject: Reply with quote

Hi Sandra,

Looks like this is the same issue as a known problem (TPR#18947) which occurs when the threads used in the kernel are greater than the number used in the compiler generated reduction kernel. There's no work around other than to either not do the reduction or limit the number of threads to 256.

I'll add your information to the TPR.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group