PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

cuda memory issues
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Tue Jan 17, 2012 12:12 pm    Post subject: cuda memory issues Reply with quote

I'm running into the following error when trying to compile Cuda and I'm looking for some advice.

[leiderml@ebwilson-mpi ~]$ pgfortran -Mcuda ibe-25Cuda.f
ptxas error : Entry function 'case8' uses too much local data (0xbdec bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (ibe-25Cuda.f: 611)
PGF90/x86-64 Linux 11.5-0: compilation aborted

So basically it looks like I'm WAY over the memory usage. My module which I'm sending to the GPU has around 500 lines of code between all the functions. So I'm thinking the arrays are what is causing the problem. Although I think this GPU should have enough memory to handle this.

My GPU is:

Device Name: Tesla C2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152

The declaration for case 8 is this: (and I'll have some explanation afterwards on how I'm using it)

attributes(global) subroutine case8(m1n1SumAry,nsize,iStart,
* factrf,fact,m,n,p,q,s,a,b,c,d,ip2,jp2,kp2,lp2)
double precision, dimension(:) :: m1n1SumAry
double precision, dimension(0:500) :: factrf
double precision, dimension(0:170) :: fact
integer, value :: nsize,m,n,p,q,s,ip2,
* jp2,kp2,lp2,iStart
integer m1,n1,p1,m1mn1,p1min
double precision p1term,p1sum,n1term,n1sum,threej,a,b,c,d

Now factrf and fact are arrays of constant doubles which I'm sending to the GPU. I could compute them on the GPU, but they're still going to use the same space unless I tremendously slow down the code and recompute every factorial in that array every time it is needed.

m1n1SumAry is the variable size array that is whatever number of threads I pass in to return the values. Perhaps hard-coding the size would help with memory constraints?



So basically my questions come down to:
-Is 500 lines of code between all the functions in my module too much? Or how much can I really fit in a module between code and variables?

-Is the number of variables a problem? Or is it just the size of all the arrays combined with the size of the code? Because after the compiler strips out all the comments and shrinks everything down to machine code I'd think this would fit on the GPU no problem.

*edited, but nobody else has commented yet


Last edited by vibrantcascade on Tue Jan 17, 2012 1:27 pm; edited 1 time in total
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Tue Jan 17, 2012 12:21 pm    Post subject: Reply with quote

Ok I sort of figured out how to share arrays, but there aren't many good examples in the guide so here's what I'm trying:

module cudaCase8
contains
c double precision, dimension(0:500) :: factrf
c double precision, dimension(0:170) :: fact
c attributes(shared) :: factrf, fact

attributes(global) subroutine case8(m1n1SumAry,nsize,iStart,
* factrf,fact,m,n,p,q,s,a,b,c,d,ip2,jp2,kp2,lp2)
double precision, dimension(:) :: m1n1SumAry
double precision, dimension(0:500) :: factrf
double precision, dimension(0:170) :: fact
attributes(shared) :: factrf, fact

If I declare the shared arrays at the module level by removing the comment character before the subroutine and commenting out the declarations in the subroutine I get a severe incorrect sequence error on the actual subroutine header where I pass the arrays in. If I try to declare them shared inside the subroutine (like above) it tells me that the shared attribute is ignored and it's a dummy argument.

PGF90-S-0070-Incorrect sequence of statements (ibe-25Cuda.f: 8)
0 inform, 0 warnings, 1 severes, 0 fatal for case8
[leiderml@ebwilson-mpi ~]$ pgfortran -Mcuda ibe-25Cuda.f
PGF90-S-0070-Incorrect sequence of statements (ibe-25Cuda.f: 7)
0 inform, 0 warnings, 1 severes, 0 fatal for case8

and in the second case the error is:

PGF90-W-0526-SHARED attribute ignored on dummy argument factrf (ibe-25Cuda.f: 12)
PGF90-W-0526-SHARED attribute ignored on dummy argument fact (ibe-25Cuda.f: 12)

Could someone write or direct me towards a short bit of example code showing how to declare a shared array and populate it with an array passed into the GPU?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Tue Jan 17, 2012 2:05 pm    Post subject: Reply with quote

Hi vibrantcascade,


Quote:
Ok I just figured out how to share arrays so I'll be doing that. But I'm still wondering about overall memory usage and approximate code size constraints.
Sounds good. Another way to work around this limit is to move your local arrays to global memory (i.e. make them a module variable) and then add an extra dimension or dimensions to the arrays to privatize it (i.e each thread has their own column).

Secondly, you can reduce the number of threads in your thread block until they all fit.

Quote:
-Is 500 lines of code between all the functions in my module too much? Or how much can I really fit in a module between code and variables?
The number of lines of code doesn't matter from a resource constraint point of view (I've seen a 3600 line kernel before). What does matter is how much memory and registers each kernel uses. So if the 500 lines reuse the same variables, then this can be a big gain. Though, if you use a lot of variables, this will reduce the number of threads that can be used in a block.

Quote:
Are those constant arrays being fully copied every time I pass them to another function on the GPU?
No. Constant arrays would be passed by reference.

Quote:
If they are being copied, Is there away to make them global to the whole module like you do with common blocks in normal fortran so I can save space? Or would I be better off recomputing every factorial I need on the fly?
Constant arrays (i.e. those declared with the constant attribute) can only be declared as a module variable or within host code and therefore already global. Did you mean local device arrays instead of constant?

- Mat
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Wed Jan 18, 2012 11:28 am    Post subject: Reply with quote

Hi Mat,

When I say constant arrays I mean the values are constant. However, I'm generating the values on the CPU before I send the array off to be used on the device. (So yes, I mean local device arrays. But I'm populating those arrays with an array from the program calling the device.)

When you say:
Quote:
Constant arrays (i.e. those declared with the constant attribute) can only be declared as a module variable or within host code and therefore already global. Did you mean local device arrays instead of constant?


The "within host code and therefore already global" part, are you saying the device can already see program level variables and arrays within the host code made global by something like the use of "common"?

I'm thinking that you mean something like what I have below in pseudocode, which is essentially what I'm trying to do. Only I'm now getting the error "PGF90-S-0520-Host MODULE data cannot be used in a DEVICE or GLOBAL subprogram - facts1" every time I try to use a module level variable like facts1.

In this case, would the array I'm passing into the device for facts1 in the Start subroutine automatically populate the facts1 module level array? (the dimensions are identical of course) Do I need to do something special to use that module level array in other functions being called into? Because that error message makes it sound like it's impossible.


module cudaModule
double precision, dimension(0:500) :: facts1
attributes(shared) :: facts1
contains

attributes(global) subroutine Start(SumAry,facts1,a,b,c,d)
double precision, dimension(:) :: SumAry
integer, value :: a,b,c,d

....code.....

end

double precision attributes(device) function func1(b)
integer b,i
do i = 0, 500
b = b + facts1(i) * i
enddo
func1 = b
return
end
end module cudaModule



Thanks for the help!
Morgan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Wed Jan 18, 2012 11:57 am    Post subject: Reply with quote

Hi Morgan,

In your first example, the "facts1" was local, hence every thread had their own copy of the array. Given the size, this wastes a lot of memory. In the second example, a "shared" array means that it is shared across all threads within the same thread block. It is not global.

In the final example, you are closer but need to change "shared" to "constant". A constant array is visible to all threads and is stored in a separate, and fast access, memory. However, constant memory is read-only from the device. While it's read/write from the host. So slightly modifying your example:

Code:
module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1
contains

attributes(global) subroutine Start(SumAry,a,b,c,d)  ! no need to pass facts1
double precision, dimension(:) :: SumAry
integer, value :: a,b,c,d

....code.....

end

subroutine init_facts1()
   facts1 = 1.1  ! Init facts1 from the host
end subroutine init_facts1

double precision attributes(device) function func1(a,b)
integer a,b,i
do i = 0, 500
b = facts1(i) * a   ! facts1 is read-only
enddo
func1 = b
return
end

end module cudaModule


Quote:
Could someone write or direct me towards a short bit of example code showing how to declare a shared array and populate it with an array passed into the GPU?
Take a look at the "sgemm.cuf" example file that ships with the compilers. It will be located in the "etc/sample" directory under your PGI compiler installation tree. (For example on my system it's in /opt/pgi/linux86-64/11.10/etc/samples).

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group