PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

error 702: Launch timeout happens non-deterministically

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
lechat



Joined: 14 Oct 2012
Posts: 10

PostPosted: Tue Oct 16, 2012 3:34 pm    Post subject: error 702: Launch timeout happens non-deterministically Reply with quote

(I'm using PGi 12.8 on Linux 64, with a GeForce GTX 280 and CUDA 4.1)

I'm doing some experiments with OpenACC, and this is puzzling me:

I had the following code to perform matrix multiplications:

Code:
typedef float ff;

void mmul(const restrict ff* a,
          const restrict ff* b,
          restrict ff* c,
          const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{

#pragma acc region
{

#pragma acc loop independent vector(16)
  for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
    for (int j = 0; j < n; ++j) {
      ff sum = 0;
      for (int k = 0; k < n; ++k) {
        sum += a[i + n * k] * b[k + n * j];
      }
      c[i + n * j] = sum;
    }
  }

}
}
}


This code runs well, but I'm looking to optimize it.
I then do a small transformation:

Code:
void mmul(const restrict ff* a,
          const restrict ff* b,
          restrict ff* c,
          const int n) {

#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{

#pragma acc region
{

  for (int is = 0; is < n; is += 32) {
#pragma acc loop independent
    for (int i = is; i < is+32; ++i) {
#pragma acc loop independent
      for (int j = 0; j < n; ++j) {
        ff sum = 0;
        for (int k = 0; k < n; ++k) {
          sum += a[i + n * k] * b[k + n * j];
        }
        c[i + n * j] = sum;
      }
    }
  }
}
}
}


I simply added an external for loop, but the iteration remains basically the same.

While this isn't by itself an optimization, the result is very strange: about half of the times I run this code I get the following error:

Quote:
call to ctxSynchronize/after/__pgi_cu_uploadx returned error 702: Launch timeout


The other half of the times it simply runs, in about 8 seconds (for 1024x1024 sized matrices).
For smaller matrices it always works, so I suppose there might be a timeout issue here.

I'm not worried with the performance here, but want to understand this strange behaviour.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Tue Oct 16, 2012 4:23 pm    Post subject: Reply with quote

Hi lechat,

Let's look at the compiler feedback messages for these two loops:

Code:
mmul:
     11, Generating copyout(c[0:n*n])
         Generating copyin(b[0:n*n])
         Generating copyin(a[0:n*n])
     14, Generating present_or_copyout(c[0:n*n])
         Generating present_or_copyin(a[0:n*n])
         Generating present_or_copyin(b[0:n*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
     20, Loop is parallelizable
         Accelerator kernel generated
         18, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */
         20, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y */
             CC 1.0 : 20 registers; 64 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
     22, Loop is parallelizable
mmul2:
     38, Generating copyin(b[0:n*n])
         Generating copyin(a[0:n*n])
         Generating copy(c[0:n*n])
     41, Generating present_or_copy(c[0:n*n])
         Generating present_or_copyin(a[0:n*n])
         Generating present_or_copyin(b[0:n*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     44, Complex loop carried dependence of '*(c)' prevents parallelization
         Loop carried dependence of '*(c)' prevents parallelization
         Loop carried backward dependence of '*(c)' prevents vectorization
         Complex loop carried dependence of '*(b)' prevents parallelization
         Complex loop carried dependence of '*(a)' prevents parallelization
         Accelerator kernel generated
         44, CC 1.0 : 20 registers; 64 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
         46, #pragma acc loop vector(32) /* threadIdx.x */
         Loop is parallelizable
     48, Loop is parallelizable
     50, Loop is parallelizable
main:
     89, Generating present_or_copyin(B[0:size][0:size])
         Generating present_or_copyin(A[0:size][0:size])

For the first loop, you get a nice 2D gang (grid) with a 2D vector (thread block). However for the second because of the loop carried dependency (the compiler can't tell independence of computed array indices), only a single gang is used with a single 1D vector. To fix, you need to add "independent" to the outer loop and add some schedule clauses:
Code:
#pragma acc region
{

#pragma acc loop independent gang
  for (int is = 0; is < n; is += 32) {
#pragma acc loop independent vector (32)
    for (int i = is; i < is+32; ++i) {
#pragma acc loop independent vector (16)
      for (int j = 0; j < n; ++j) {
        ff sum = 0;
        for (int k = 0; k < n; ++k) {
          sum += a[i + n * k] * b[k + n * j];
        }
        c[i + n * j] = sum;
      }
    }

Quote:

For smaller matrices it always works, so I suppose there might be a timeout issue here.
Most likely X is killing your run. Is your GTX280 attached to a monitor?

Hope this helps,
Mat
Back to top
View user's profile
lechat



Joined: 14 Oct 2012
Posts: 10

PostPosted: Tue Oct 16, 2012 9:27 pm    Post subject: Reply with quote

Yes, it helps. Thanks Mat.
Back to top
View user's profile
Neldan



Joined: 12 Feb 2013
Posts: 11

PostPosted: Thu Feb 14, 2013 4:20 am    Post subject: Reply with quote

i have the same problem, but my devices are not attached to display

My cuda devices are:
- GTX 580
- GTX 460
- TESLA C2075

my code:

Code:
       #pragma acc data copyin(m1[0:numFilas1][0:numColumnas1],m2[0:numFilas2][0:numColumnas2]), copyout(resultado[0:numFilas1][0:numFilas2])
        {

                int i,j;

                #pragma omp parallel for default(shared)
                #pragma acc kernels
                for (i=0;i<numFilas1;i++)
                {
                        #pragma omp parallel for
                        #pragma acc loop
                        for(j=0;j<numFilas2;j++)
                        {
                                int k = 0;
                                real_t acumulador = 0;

                                for(k=0;k<numColumnas1;k++)
                                        acumulador += m1[i][k] * m2[j][k];
                                resultado[i][j] = acumulador;
                        }
                }
        }



my code with your suggested changes:

Code:

        #pragma acc data copyin(m1[0:numFilas1][0:numColumnas1],m2[0:numFilas2][0:numColumnas2]), copyout(resultado[0:numFilas1][0:numFilas2])
        {

                #pragma acc region
                {
                        int i,j;
                        #pragma omp parallel for default(shared)
                        #pragma acc loop independent
                        for (i=0;i<numFilas1;i++)
                        {
                                #pragma omp parallel for
                                #pragma acc loop independent
                                for(j=0;j<numFilas2;j++)
                                {
                                        int k = 0;
                                        real_t acumulador = 0;

                                        for(k=0;k<numColumnas1;k++)
                                                acumulador += m1[i][k] * m2[j][k];
                                        resultado[i][j] = acumulador;
                                }
                        }
                }
        }


on gtx 580 and 460, the execution does fail by timeout, and only with the tesla device seems to end at 40 seconds

the data test i used are 5000x5000 sized matrix
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group