PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

cuMemAlloc Error

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Pebbles



Joined: 24 Sep 2010
Posts: 13

PostPosted: Fri Dec 17, 2010 1:36 pm    Post subject: cuMemAlloc Error Reply with quote

Hi,

I am using the PGI Accelerator with Fortran directives to accelerate the following code snipet. When I execute the code, I am getting a cuMemAlloc error. The arrays do not appear to be that large so I am wondering why I am getting this error.

Thanks for the help,
Karan

=================== Code Sample =====================

PROGRAM SAMPLE

!**** PARAMETER Declarations

INTEGER, PARAMETER :: NB= 64, NW = 224, NX=753, NY = 500, KMAX=NX*NY

!**** INTRINSIC Declarations

INTRINSIC SQRT,EXP ! In GPU processing section

!**** Local Variable Declarations

INTEGER :: I,J,K,L,IX,IY,IW,NY,NW,IB,DUMA(0:NB,2),IE,IVS,IV,KK,
INTEGER (KIND=KI4), ALLOCATABLE :: IDUMA(:,:)
REAL :: SIGMA,WL,WT,CL,FRACT,SUM,BL,XYZ_M,XYZ
REAL , ALLOCATABLE :: B(:,:,:),C(:,:),W0(:),DW(:),V_MS(:,:,:),RD(:,:,:,:)

!**** Begin numerical processing

! This DO loop seg faults if accelerated. Something with WL?

DO IW = 1,NW
DO IB = 1,NB
C(IW,IB)=0.0
END DO
SUM=0.0
DO IV = 1,NVS
WL=1.E+04/V_MS(IV,IVS,1)
SIGMA=0.001*DW(IW)/FACTOR
WT=EXP(-(W-0.001*W0(IW))**2/(2.*SIGMA**2))
DO IB = 1,NB
C(IW,IB)=C(IW,IB)+RD(IV,IVS,1,IB)*WT
END DO
SUM=SUM+WT
END DO
DO IB = 1,NB
C(IW,IB)=C(IW,IB)/SUM
END DO
END DO

!$acc region
!$acc private(C)
314 DO IB = 1,NB
CL=0.0
316 DO IW = 1,NW
CL=CL+C(IW,IB)**2
END DO
IF (CL<EPSMIN4) CL=1.0
CL=SQRT(CL)
321 DO IW = 1,NW
C(IW,IB)=C(IW,IB)/CL
END DO
END DO
!$acc end region

!$acc region
!$acc do private(IDUMA, B)
334 DO K = 1,KMAX
KK=(K-1)/NX
IY= KK+1
IX=K-KK*NX

BL=0.0
342 DO IW = 1,NW
IF (B(IX,IY,IW)>0.0) BL=BL+B(IX,IY,IW)**2
END DO
IF (BL<EPSMIN4) BL=1.0
BL=SQRT(BL)
347 DO IW = 1,NW
B(IX,IY,IW)=B(IX,IY,IW)/BL
END DO

IDUMA(IX,IY)=0
XYZ_M=0.0
355 DO IB = 1,NB
IE=DUMA(IB,1)
XYZ=0.0
358 DO IW = 1,NW
XYZ=XYZ+B(IX,IY,IW)*C(IW,IB)
END DO

IF (XYZ>XYZ_M.AND.IE/=39.AND.IE/=49) THEN
! IF (XYZ>XYZ_M) THEN
XYZ_M=XYZ
IDUMA(IX,IY)=DUMA(IB,1)
END IF
END DO
END DO
!$acc end region

!$acc region
!$acc do private(DUMA)
376 DO K = 1,KMAX
KK=(K-1)/NX
IY=KK+1
IX=K-KK*NX
IF (IDUMA(IX,IY)==0) DUMA(0,2)=DUMA(0,2)+1
381 DO IB = 1,NB
IF (IDUMA(IX,IY)==DUMA(IB,1)) DUMA(IB,2)=DUMA(IB,2)+1
END DO
END DO
!$acc end region

!**** End of numeric processing

STOP 'Normal termination'
END PROGRAM SAMPLE

===================Start Compiler Output=================

257, Invariant assignments hoisted out of loop
286, Invariant if transformation
291, Invariant assignments hoisted out of loop
312, Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
314, Loop is parallelizable
Accelerator kernel generated
314, !$acc do parallel, vector(64)
CC 1.0 : 18 registers; 20 shared, 100 constant, 0 local memory bytes; 50 occupancy
CC 1.3 : 17 registers; 20 shared, 100 constant, 0 local memory bytes; 50 occupancy
316, Loop is parallelizable
321, Loop is parallelizable
332, Generating copyin(model_radiance(1:nwl,1:64))
Generating copyin(end_member(1:64,1))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
334, Loop is parallelizable
Accelerator kernel generated
334, !$acc do parallel, vector(256)
CC 1.0 : 28 registers; 20 shared, 180 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 28 registers; 20 shared, 180 constant, 0 local memory bytes; 50 occupancy
342, Loop is parallelizable
347, Loop is parallelizable
355, Loop carried scalar dependence for 'dotpr_matl' at line 364
Complex loop carried dependence of 'iend_member' prevents parallelization
Loop carried reuse of 'iend_member' prevents parallelization
358, Loop is parallelizable
374, Generating copyin(iend_member(:,:))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
376, Loop is parallelizable
Accelerator kernel generated
376, !$acc do parallel, vector(32)
Non-stride-1 accesses for array 'end_member'
CC 1.0 : 9 registers; 20 shared, 84 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 9 registers; 20 shared, 84 constant, 0 local memory bytes; 25 occupancy
381, Loop is parallelizable
402, Invariant assignments hoisted out of loop
====================Execution Output ===============
Number of data points:
Number of samples = 200
Number of lines = 200
Number of bands = 224
Number of end-members = 64
launch kernel file=/home/users/elliott/HypGP/HypGP.f95 function=hypgp line=314 device=0 grid=1 block=64
call to cuMemAlloc returned error 2: Out of memory
CUDA driver version: 3010
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Fri Dec 17, 2010 3:18 pm    Post subject: Reply with quote

Hi Karan,

My best guess is that this isn't the same program that gets the cuMalloc error, rather just a portion of it. (The posted source has a number of semantic and syntax errors as well as fails to allocate memory causing seg fauts.) Can you send the full source to PGI customer service (trs@pgroup.com) or post a reproducing example?

- Mat
Back to top
View user's profile
Pebbles



Joined: 24 Sep 2010
Posts: 13

PostPosted: Mon Dec 20, 2010 6:45 am    Post subject: Reply with quote

Hi Mat,

Unfortunately, the code is proprietary so I can't post it. I tried to extract the region that had been accelerated and provide the dimensions for the arrays hoping that would give a clue as to the reason for the memory problems. Since the code is fairly complex, I am not sure I could provide a sample working abstraction that would replicate the problems I am seeing.

Are there any debugging tools or ways to monitor the GPU to see what is going on while the kernel code is executing?

Thanks for your help,
Karan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Mon Dec 20, 2010 6:40 pm    Post subject: Reply with quote

Hi Karan,

Our debugger doesn't run on GPU code, but given that the GPU code is auto generated low level CUDA C you probably wouldn't want to try and debug it.

The error "call to cuMemAlloc returned error 2: Out of memory" means that your program attempted to allocate too much memory on the device. This could simply mean that your device's memory is too small for your problem, or it could mean that your program has a bug where it allocating more memory than it needs (such as using an uninitialized variable as the allocatable size). Try compiling and running your program without the accelerator flag, and instead compile with "-g -Mbounds -Mchkptr". Also, if you are running on Linux, the Valgrind utility is very good at finding memory problems (www.valgrind.org).

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group