PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

large input value sets

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Wed Feb 06, 2013 7:44 pm    Post subject: large input value sets Reply with quote

I'm currently running cuda fortran and my code's worst case scenario needs to generate 6 integers as inputs for every cuda thread.

Currently I have 6 arrays of integers with 2048 integers per array, and before my call to the GPU I call a global subroutine in the GPU kernel to set 6 constant arrays of 2048 integers each equal to the input arrays. (I believe this loads them into high speed read only texture memory if I remember correctly.) I then call the GPU with a 2048 member array of doubles to get the results. Then I generate the next set of 2048 input values and repeat.

The GPU only takes about 2 to 8 seconds to complete the 2048 threads, and as a result is constantly doing I/O and wasting a lot of time. I'd like to pass say 10,000+ threads at a time to get better performance as these calculations run for weeks overall, but it appears that 6 arrays of integers with 2048 integers each uses up all of the 48k or so of read only memory and I get insufficient memory errors if I increase the number of values in the 6 input arrays much past this.

So is there a way to use the 3+ gigs of main videocard memory to load up more values or stream in groups of new values when the old values finish to save myself the 200ms or so of IO time I'm wasting every few seconds and let the GPU churn away longer?

I have access to both a fermi C2050 GPU and a GTX680 GK104 if it matters.

I'm assuming come July when I get my hands on a GK110 which can have kernels that call kernels I'll be able to fix this by simply making the main GPU calling loop into another kernel. But I'm wondering if I can do anything for older GPUs.

Thanks!
Morgan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Thu Feb 07, 2013 9:17 am    Post subject: Reply with quote

Quote:
(I believe this loads them into high speed read only texture memory if I remember correctly.)
Close, it's actually put in constant memory, not texture.

Quote:
So is there a way to use the 3+ gigs of main videocard memory to load up more values or stream in groups of new values when the old values finish to save myself the 200ms or so of IO time I'm wasting every few seconds and let the GPU churn away longer?
Sure, there's nothing in CUDA Fortran that limits you in terms of memory size. You may need to add the flag "-Mlarge_arrays" if an individual array is over 2GB. The limiting factor will be your card's memory. To see the limits of your cards, use the utility "pgaccelinfo".

Note that while constant memory size could vary, it's typically only 64k. So you'd need to move your constant arrays over to global memory.

- Mat
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Fri Feb 08, 2013 9:57 am    Post subject: Reply with quote

When I try to remove the "attributes(constant)" flag I start getting this message:

PGF90-S-0520-Host MODULE data cannot be used in a DEVICE or GLOBAL subprogram - m1aryd (i4six6oddCuda.f: 31)

If I'm not using constant data can I only pass data to the kernel through the main global subroutine I call to spawn the threads and run on the gpu as a local array? I figured when I removed the constant flag it would simply load the array data into the larger 3 gigs of memory on the card instead of the 64k of constant memory when I made the call and passed the kernel off to the gpu?

I was doing something like this before when using constants:

module i4six6oddcuda
c making variables local to module
double precision, dimension(0:500) :: factrfD
double precision, dimension(0:170) :: factD
integer, dimension(1:2048) :: m1AryD
integer, dimension(1:2048) :: n1AryD
integer, dimension(1:2048) :: p1AryD
integer, dimension(1:2048) :: q1AryD
integer, dimension(1:2048) :: s1AryD
integer, dimension(1:2048) :: t1AryD
attributes(constant) :: factrfD,factD,m1AryD,n1AryD,p1AryD
attributes(constant) :: q1AryD,s1AryD,t1AryD
contains

subroutine setMNPQSTarrays(m1Ary,n1Ary,p1Ary,q1Ary,s1Ary,t1Ary)
integer, dimension(1:2048) :: m1Ary
integer, dimension(1:2048) :: n1Ary
integer, dimension(1:2048) :: p1Ary
integer, dimension(1:2048) :: q1Ary
integer, dimension(1:2048) :: s1Ary
integer, dimension(1:2048) :: t1Ary
m1AryD = m1Ary
n1AryD = n1Ary
p1AryD = p1Ary
q1AryD = q1Ary
s1AryD = s1Ary
t1AryD = t1Ary
end subroutine setMNPQSTarrays
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Fri Feb 08, 2013 10:32 am    Post subject: Reply with quote

You probably forgot to add the "device" attribute. Without it, the module data is a host side variable and can't be used on the device.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group