PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

cudaEventSynchronize returned status 4: unspecified

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
sslgamess



Joined: 23 Nov 2009
Posts: 35

PostPosted: Tue Dec 20, 2011 4:17 pm    Post subject: cudaEventSynchronize returned status 4: unspecified Reply with quote

Hello,

I'm compiling the toy code below with the flags

-Mcuda -ta=nvidia,wait,time

If natoms is greater than 472 I get the following error message:

line 23: cudaEventSynchronize returned status 4: unspecified launch failure

Am I exceeding a resource limitation on the Tesla C2075?

- Sarom

Code:

      program test
     
      implicit none
     
      integer irad, iang, iatom, jatom, igridpoint
      integer, parameter :: natoms = 473
      integer, parameter :: nradpt = 96
      integer, parameter :: nlebpt = 1202
     
      double precision, dimension(natoms) :: tempa
      double precision, dimension(natoms,natoms) :: tempb
      double precision, dimension(natoms,nradpt*nlebpt) :: tempc
      double precision, dimension(nradpt*nlebpt) :: tempd
      double precision valuea,valueb,valuec,valued,valuee
     
      do irad=1,nradpt
!$acc data region
!$acc> copyout(tempc(1:natoms,(irad-1)*nlebpt+1:(irad-1)*nlebpt+nlebpt))
!$acc region
!$acc do parallel, private(tempa(1:natoms),tempb(1:natoms,1:natoms))
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          do iatom=1,natoms
            tempa(iatom)=1.0D+00
            do jatom=1,natoms
              tempb(jatom,iatom)=1.0D+00
            if (iatom .eq. jatom) cycle
              valuea=dble(jatom)/(22.0D+00/7.00D+00)
              valueb=sin(valuea)
              valuec=dble(iatom)/(22.0D+00/7.00D+00)
              valued=cos(valuec)
              valuee=abs(atan(valued/valueb))
              tempb(jatom,iatom)=valuee**4.0D+00
            enddo
            tempa(iatom)=product(tempb(1:natoms,iatom),1)
          enddo
          tempc(1:natoms,igridpoint)=tempa(1:natoms)
        enddo
!$acc end region
!$acc end data region
      enddo
     
      do irad=1,nradpt
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          tempd(igridpoint)=sum(tempc(1:natoms,igridpoint),1)*
     >    dble(igridpoint)/dble(nlebpt*nradpt)
        enddo
      enddo
     
      write(*,*) 'SUM:= ', sum(tempd(1:nradpt*nlebpt),1)
      end
 


Build output

Code:

------ Build started: Project: test-loop-b, Configuration: Debug x64 ------
Compiling Project  ...
test-b.f
test:
     17, Generating copyout(tempc(:,(irad-1)*1202+1:(irad-1)*1202+1202))
     21, Loop is parallelizable
     23, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel ! blockidx%y
             Using register for 'tempa'
         23, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
     25, Loop is parallelizable
     35, product reduction inlined
         Loop is parallelizable
     37, Loop is parallelizable
         Accelerator kernel generated
         21, !$acc do parallel ! blockidx%y
         37, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
     46, sum reduction inlined
     51, sum reduction inlined
Linking...
test-loop-b build succeeded.

Build log was saved at "file://C:\gamessVS\11.28.2011\test\test-loop-b\x64\Debug\BuildLog.htm"
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Thu Dec 22, 2011 2:36 pm    Post subject: Reply with quote

Hi Sarom,

Private arrays are allocated in one large chunk of memory on the device with each thread getting it's own section. With 1202 threads, each having a 473x473 elements apiece, at 8 bytes per element, the total memory is just over 2GB. When the arrays are 472, the memory is 1.9999GB.

In looking at the generated GPU code, were using "int" data types to calculate the address of tempb, so when the size goes over 2GB, the index overflows the 32-bit data type. I've asked out engineers to take a look at what can be done. Most likely we'll need to add a flag similar to "-Mlarge_arrays" for the GPU.

The work around is to use small arrays, smaller data type, or manually privatize your arrays. We use a slightly different way to calculate addresses on global arrays that happens to work in this case. It's not an ideal solution though since it wastes a lot of host memory and requires the medium memory model.

Code:
 % cat acc_12_21_11.f

      program test
     
      implicit none
     
      integer irad, iang, iatom, jatom, igridpoint
      integer, parameter :: natoms = 473
      integer, parameter :: nradpt = 96
      integer, parameter :: nlebpt = 1202
     
      double precision, dimension(nlebpt,natoms) :: tempa
      double precision, dimension(nlebpt,natoms,natoms) :: tempb
      double precision, dimension(natoms,nradpt*nlebpt) :: tempc
      double precision, dimension(nradpt*nlebpt) :: tempd
      double precision valuea,valueb,valuec,valued,valuee
     
      do irad=1,nradpt
!$acc data region
!$acc> copyout(tempc(1:natoms,(irad-1)*nlebpt+1:(irad-1)*nlebpt+nlebpt))
!$acc region local(tempb,tempa)
!$acc do parallel
!, private(tempa(1:natoms),tempb(1:natoms,1:natoms))
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          do iatom=1,natoms
            tempa(iang,iatom)=1.0D+00
            do jatom=1,natoms
              tempb(iang,jatom,iatom)=1.0D+00
            if (iatom .eq. jatom) cycle
              valuea=dble(jatom)/(22.0D+00/7.00D+00)
              valueb=sin(valuea)
              valuec=dble(iatom)/(22.0D+00/7.00D+00)
              valued=cos(valuec)
              valuee=abs(atan(valued/valueb))
              tempb(iang,jatom,iatom)=valuee**4.0D+00
            enddo
            tempa(iang,iatom)=product(tempb(iang,1:natoms,iatom),1)
          enddo
          tempc(1:natoms,igridpoint)=tempa(iang,1:natoms)
        enddo
!$acc end region
!$acc end data region
      enddo
     
      do irad=1,nradpt
        do iang=1,nlebpt
          igridpoint=(irad-1)*nlebpt+iang
          tempd(igridpoint)=sum(tempc(1:natoms,igridpoint),1)*
     >    dble(igridpoint)/dble(nlebpt*nradpt)
        enddo
      enddo
     
      write(*,*) 'SUM:= ', sum(tempd(1:nradpt*nlebpt),1)
      end
 

% pgf90 -ta=nvidia -Mcuda -mcmodel=medium -fast acc_12_21_11.f
% a.out
 SUM:=     130948983393.7319   
%


- Mat
Back to top
View user's profile
sslgamess



Joined: 23 Nov 2009
Posts: 35

PostPosted: Thu Dec 22, 2011 7:53 pm    Post subject: Reply with quote

Thanks for the response Mat.

Just so I am clear on this. The usage of 'int' data types to calculated the address of tempb is outside of my control? The usage of an -i8 flag would have no effect?

And how would I get your work around to work on Win64?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Tue Dec 27, 2011 11:47 am    Post subject: Reply with quote

Quote:
Just so I am clear on this. The usage of 'int' data types to calculated the address of tempb is outside of my control? The usage of an -i8 flag would have no effect?
I believe so. Then again, this is just my analysis and I may be wrong. I've asked Michael and his team to investigate once they are back from Winter Break.
Quote:

And how would I get your work around to work on Win64?
You would need to dynamically allocate the arrays. Windows doesn't support large static arrays, only dynamic.

- Mat
Back to top
View user's profile
sslgamess



Joined: 23 Nov 2009
Posts: 35

PostPosted: Sat Mar 24, 2012 3:38 am    Post subject: Reply with quote

Hi Mat,

Have the engineers offer a compiler flag to allow for indexing of privatized arrays to beyond 2GB?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group