PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

EventSynchronize returned error 700
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
WENYANG LIU



Joined: 26 Sep 2010
Posts: 11

PostPosted: Mon Oct 25, 2010 8:33 am    Post subject: EventSynchronize returned error 700 Reply with quote

Hi,

I have an example code here:

Code:

      program testsum

      implicit none
      integer::i,j,k,lj,rj
      integer::a(10,3),suma(5,3)
     
!$acc data region local(a,suma)
      do k=1,500

!$acc region
         do i=1,10
            a(i,1)=i
            a(i,2)=i*2
            a(i,3)=i*3
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,1)=sum(a(lj:rj,1))
            suma(j,2)=sum(a(lj:rj,2))
            suma(j,3)=sum(a(lj:rj,3))
         end do
!$acc end region

      end do
!$acc updateout(suma)
!$acc end data region

      write(*,*)suma(:,1)
      end program


It fails to run:
Code:

call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 2030

Accelerator Kernel Timing data
/mnt/home/../test.f
  testsum
    18: region entered 138 times
        time(us): total=4499 init=14 region=4485
                  kernels=1499 data=0
        w/o init: total=4485 max=136 min=31 avg=32
        19: kernel launched 138 times
            grid: [1]  block: [5]
            time(us): total=1499 max=12 min=10 avg=10
/mnt/home/../test.f
  testsum
    10: region entered 138 times
        time(us): total=4142 init=19 region=4123
                  kernels=1109 data=0
        w/o init: total=4123 max=134 min=28 avg=29
        11: kernel launched 138 times
            grid: [1]  block: [10]
            time(us): total=1109 max=13 min=7 avg=8
/mnt/home/../test.f
  testsum
    7: region entered 1 time
        time(us): init=696724



Can someone explain this error?

Thanks.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Tue Oct 26, 2010 4:20 pm    Post subject: Reply with quote

Hi WENYANG LIU,

Thanks for the report. I'll need to pass this one on to our compiler engineers for further investigation (I added TPR#17300).

In the mean time, one work around is to have each sum containing in their own loop. For example:
Code:
% cat test2.f90

      program testsum

      implicit none
      integer::i,j,k,lj,rj
      integer::a(10,3),suma(5,3)

       suma=0     
!$acc data region local(a,suma)
      do k=1,50

!$acc region
         do i=1,10
            a(i,1)=i
            a(i,2)=i*2
            a(i,3)=i*3
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,1)=sum(a(lj:rj,1))
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,2)=sum(a(lj:rj,2))
         end do
!$acc end region

!$acc region
         do j=1,5
            lj=(j-1)*2 + 1
            rj=j*2
            suma(j,3)=sum(a(lj:rj,3))
         end do
!$acc end region

      end do
!$acc updateout(suma)
!$acc end data region

      write(*,*)suma(:,1)
      write(*,*)suma(:,2)
      write(*,*)suma(:,3)
      end program
% pgf90 -ta=nvidia -Minfo=accel test2.f90 -V10.9 ; a.out
testsum:
      9, Generating local(suma(:,:))
         Generating local(a(:,:))
     12, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     13, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 20 shared, 16 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 16 constant, 0 local memory bytes; 25 occupancy
     20, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     21, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel, vector(5)
             CC 1.0 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     24, Loop is parallelizable
     28, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     29, Loop is parallelizable
         Accelerator kernel generated
         29, !$acc do parallel, vector(5)
             CC 1.0 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     32, Loop is parallelizable
     36, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     37, Loop is parallelizable
         Accelerator kernel generated
         37, !$acc do parallel, vector(5)
             CC 1.0 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 6 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     40, Loop is parallelizable
     45, Generating !$acc update host(suma(:,:))
            3            7           11           15           19
            6           14           22           30           38
            9           21           33           45           57


Best Regards,
Mat
Back to top
View user's profile
sameer.asal



Joined: 13 Dec 2011
Posts: 1

PostPosted: Tue Dec 13, 2011 6:37 am    Post subject: Reply with quote

Hello ,

I am getting a similar problem for this code:

Code:
   

#define M     256*4
#define N     256*4
#define K     256*4

int t1, t2, t3;
   #pragma acc region copyin( PGI_OPT_C[0:N-1 ][0:N-1], PGI_OPT_B[0:N-1 ][0:N-1], PGI_OPT_A[0:N-1 ][0:N-1] )//, copyout( PGI_OPT_C[ 0:N*N-1  ] )
   {
     for (t1=0;t1<=N-1;t1++)
     {
       for (t2=0;t2<=N-1;t2++)
       {
         for (t3=0;t3<=N-1;t3++)
         {
       PGI_OPT_C[ t1*N + t2]= PGI_OPT_C[ t1*N + t2 ] + PGI_OPT_A[ t1*N + t3 ]*PGI_OPT_B[ t3*N + t2];
       //PGI_OPT_C[ t1][ t2] = PGI_OPT_C[ t1][ t2 ] + PGI_OPT_A[ t1 ][ t3 ]*PGI_OPT_B[ t3][ t2 ];
      //PGI_OPT_C[ t1 ] = PGI_OPT_C[ t1 ]*2 ;
         }
       }
     }   
   }


Oddly enough, this only happens when M , N , K are defined with sizes greater than 1024.

Also If I set the size to be 256*256 I get a compilation error:
Code:
PGC-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (main.c: 168)


Any thoughts on this ?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Wed Dec 14, 2011 9:30 am    Post subject: Reply with quote

Hi sameer.asal,

Can you please post or send to PGI Customer Service (trs@pgroup.com) a complete reproducing example? I don't see anything unusual about your code snip-it so will the complete example to determine the problem. Other useful information would be what type of device you are using and the compiler version.

"700" is a generic error and is most likely unrelated to Wenyang Liu's issue. I've seen it most when their is a seg fault when copying data to the device. Check your array bounds match the copy clause.

- Mat
Back to top
View user's profile
sunitachandra



Joined: 26 Mar 2012
Posts: 2

PostPosted: Tue Jul 03, 2012 9:29 am    Post subject: call to EventSynchronize returned error 700: Launch failed Reply with quote

Hi

Here is the sample code on jacobi iteration. Though it gets compiled OK, doesnt get executed OK. I am using PGI 12.3 and tried with 12.4 as well. n and m are 512

Code snippet:
#pragma omp parallel for shared(Anew)
for (int i = 1; i < m; i++)
{
Anew[0][i] = 0.f;
Anew[n-1][i] = 0.f;
}
#pragma omp parallel for shared(Anew)
for (int j = 1; j < n; j++)
{
Anew[j][0] = y0[j];
Anew[j][m-1] = y0[j]*expf(-pi);
}

while ( error > tol && iter < iter_max )
{
error = 0.f;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
}
}

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}

On execution:
./laplace2d_acc
Jacobi relaxation Calculation: 4096 x 4096 mesh
call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 4020

Accelerator Kernel Timing data
/home/sunita/tests/openacc/parallel-forall-code-samples-78e5f9b/posts/002-openacc-example/step1/laplace2d.c
main
74: region entered 1 time
time(us): init=119031
data=29051
And if
#pragma acc data copy(A, Anew) added before
while ( error > tol && iter < iter_max )
{
error = 0.f;

then I get the following error

Jacobi relaxation Calculation: 4096 x 4096 mesh
call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 4020

Please let me know if there is a problem in the code ?

Sunita
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group