PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

cuda memory issues
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Wed Jan 18, 2012 2:10 pm    Post subject: Reply with quote

Thanks Mike! I see the memory usage shrinking already. (still a ways to go though)

If I wanted a and b to be shared because they'll always be the same within the threadblock would I essentially do this then?

Although now that I think about it, there's no way this would work at the block level as the device wouldn't know what block to assign the values to when called from the host. Do I just pass them in with the initial call and then I don't have to declare the shared a and b inside of subroutine Start?

Quote:

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1
integer, value :: a,b
attributes(shared) :: a,b
contains

attributes(global) subroutine Start(SumAry,c,d) ! no need to pass facts1, a, b
double precision, dimension(:) :: SumAry
integer, value :: c,d

....code.....

end

subroutine init_facts1()
double precision, dimension(0,500) :: factsHost
common facts/factsHost
facts1 = factsHost ! Init facts1 from the host
end subroutine init_facts1

subroutine init_ints(aHost,bHost)
integer aHost,bHost
a = aHost
b = bHost
end subroutine init_ints

double precision attributes(device) function func1(a,b)
integer a,b,i
do i = 0, 500
b = facts1(i) * a ! facts1 is read-only
enddo
func1 = b
return
end

end module cudaModule


Thanks!
Morgan
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Wed Jan 18, 2012 2:22 pm    Post subject: Reply with quote

So this rather than the last example?

Quote:

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1
integer, value :: a,b
attributes(shared) :: a,b
contains

attributes(global) subroutine Start(SumAry,a,b,c,d) ! no need to pass facts1
double precision, dimension(:) :: SumAry
integer, value :: c,d

....code.....

end

subroutine init_facts1()
double precision, dimension(0,500) :: factsHost
common facts/factsHost
facts1 = factsHost ! Init facts1 from the host
end subroutine init_facts1

double precision attributes(device) function func1()
integer a,b,i
do i = 0, 500
b = facts1(i) * a ! facts1 is read-only
enddo
func1 = b
return
end

end module cudaModule
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Jan 18, 2012 3:00 pm    Post subject: Reply with quote

Hi Morgan,

The shared attribute can only be applied to local variables. Module variables can only be "device" (i.e stored in device memory, accessible by all threads, read and writable) or "constant" (i.e. stored in constant memory, accessible by all threads, read-only on the device).

Quote:
If I wanted a and b to be shared because they'll always be the same within the threadblock would I essentially do this then?
I guess I'm not clear what you're trying to do here or why you want a and b to be shared.

Since the value of b does change, having it shared could cause coherency problems if it's not properly guarded. Having it as a device variable would even be worse. It seems to me that passing b by value is your best option.

Since a doesn't change, putting it in constant memory would be best.

Can you post a more complete example of what you're trying to do? That might help me give you better advice.

- Mat
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Thu Jan 19, 2012 1:00 pm    Post subject: Reply with quote

Hi Mat,

Yeah I suppose my example is bad in that case. My actual code is around 5,000 lines of host code which I've updated from fortran 77 standards to get it working in openMP while teaching myself fortran along the way. (unfortunately they didn't teach fortran anymore when I was getting my comp sci degree)

Now I'm converting the slowest section into Cuda as a test. In my actual code I have 10 variables which are basically constant for all of the threads in a block. There's about 5 device functions in the device code which use the variables, and so if I pass explicitly it's eating up (10 variables X 5 functions) or 50 variables worth of space per thread from what I'm getting. Now multiply that times either the 32 threads I'm using in a thread block, or the 448 overall threads that I'm making in total and you suddenly have thousands of copies of the same variables which don't change between the threads in a block. Which is probably a good portion of my memory problems.

(Perhaps I'm a little off on how this works, but from what I've read and what you said it appears that there's memory set aside for every variable passed in every function on the device per thread in a thread block. So If I keep my thread blocks down to the minimum size of 32 and just make more blocks it should help. Which would make sense as that's how most compilers operate.)

If I made the host code that can call into the GPU multi-threaded so I had 2 different host threads spawning thread blocks, I would get thread blocks with different values for those 10 variables. My arrays that I asked about earlier would be constant for all cases even between thread blocks, but those 10 variables wouldn't be.

So I guess it comes down to this.

-If I'm running 2 threads in my host code (openMP), and they both happen to be spawning Cuda thread blocks at the same time where all the threads in a block happen to have the same values for those 10 variables, but the 2nd thread is spawning thread blocks for the same call with different values for those 10 variables, and I make those variables constant, will I have a problem? Or can the GPU tell that the constant variables came from a different thread and handle that seamlessly?

It was appearing that because shared variables are shared within a thread block that I might want to use those in this case to reduce memory overhead as I'm still way over.

Thanks!
Morgan

(And sorry for the long posts. There's a lot that the couple of examples you guys have don't cover, and things are kind of vague in the programmers guide at times.)
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Thu Jan 19, 2012 2:34 pm    Post subject: Reply with quote

Here's basically what I have now in pseudocode, only tremendously simplified with a few layers removed.

Quote:

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1

contains

attributes(global) subroutine Start(SumAry,a,b,c,d,e,f,g,h,i,j) ! no need to pass facts1
double precision, dimension(:) :: SumAry
double precision, value :: a,b,c,d,e,f,g,h,i,j,i1

i1 = (blockIdx%x-1)*blockDim%x + threadIdx%x
SumAry(i1)=func1(a,b,c,d,e,f,g,h,i,j,other vars)

end

subroutine init_facts1()
double precision, dimension(0,500) :: factsHost
common facts/factsHost
facts1 = factsHost ! Init facts1 from the host
end subroutine init_facts1

double precision attributes(device) function func1(,a,b,c,d,e,f,g,h,i,j)
double precision tot,a,b,c,d,e,f,g,h,i,j
do i = 0, 500
tot=func2(a,b,c,d,e,f,g,h,i,j,other vars)
enddo
...calculations....
func1 = tot
return
end

double precision attributes(device) function func2(,a,b,c,d,e,f,g,h,i,j)
double precision tot,a,b,c,d,e,f,g,h,i,j,tot1,tot2,tot3,....
...code which uses facts1.....
tot1=func3(a,b,c,d,e,f,g,h,i,j)
tot2=func3(a,b,c,d,g,h,i,j,f,e)
tot3=func3(a,b,c,d,h,i,e,j,f,g)
..more premutations...
func2 = tot1 + tot2 + tot3 + ...
end

double precision attributes(device) function func3(a,b,c,d,e,f,g,h,i,j)
double precision, a,b,c,d,e,f,g,h,i,j
....calculations and variables..... uses facts1
end

end module cudaModule


So basically in this example the values of a,b,c,d,e,f,g,h,i,j are constant in " Start, func1, and func2 " and in func3 they have to be passed by value due to the different permutations. In my actual code there's more layers of passing currently. But basically I want to save the memory overhead of having memory set aside for every variable in " Start, func1, and func2 " times the number of threads. Perhaps I should be setting them to constant and then doing a call on the host to a module subroutine to set them all before I generate the threadblocks?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group