PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

cudaHostRegister and Fortran

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
TheMatt



Joined: 06 Jul 2009
Posts: 322
Location: Greenbelt, MD

PostPosted: Fri Jul 22, 2011 6:08 am    Post subject: cudaHostRegister and Fortran Reply with quote

In the code I currently work with, many of the host arrays that are copied to my device kernels are allocated far "above" the actual computation routines. As you traverse the tree down from, for example, physics to moist processes, by then an array for temperature is actually a pointer to that space allocated far above.

This setup has often stymied my ability to try async/double buffering, use faster memory transfers, &c. since I'd need pinned memory and it's a bit difficult to pinpoint, exactly, who first allocated that array to pin it there. (I could, of course, allocate new pinned memory buffers inside my local routines and work with them, but that would be doubling the host memory needed for quite a few, very large arrays. Plus, allocating pinned memory is slow, so doing it every timestep is not good.)

So, I was intrigued when I learned of cudaHostRegister in CUDA 4.0. It seems like it would help in that I could register, say, my local temperature pointer and then gain the chance for faster transfers, the possibility of async. copies, et al.

But that leads to my question: reading the CUDA Fortran User's Guide, am I right in thinking that I can't HostRegister a Fortran array/pointer? Rather I'd need to use iso_c_binding and have fun with posix_memalign, c_ptr, c_f_pointer, &c.? (In which case, I'd be valloc'ing a new array and doubling space again...)

Thanks for any help with this and other questions sure to surface as I explore all the new 4.0 routines.

Matt
Back to top
View user's profile
brentl



Joined: 20 Jul 2004
Posts: 132

PostPosted: Fri Jul 22, 2011 8:33 am    Post subject: Reply with quote

The cudaHostRegister implementation in CUDA 4.0 is a little difficult for us to work with in CUDA Fortran. Currently, it requires the buffer you would like to pin to be aligned on a 4K boundary, and of a size that is a multiple of 4K.

In general, the address of your buffer is not that accessible in Fortran, so this is a difficult concept for the language.

We've heard of some plans from NVIDIA to alleviate some of these issues, probably in CUDA 4.1, but no promises there.

PGI is working on an extension to CUDA Fortran, an "ALIGN" qualifier on the allocate statement, that will, as you point out, sit atop the underlying posix_memalign() or other platform-specific routine to return an aligned buffer area.

Again, no promises when that will be ready, but hopefully in the next few releases. If you allocate an area aligned on a 4K boundary, then that can be padded to the right size, to be acceptable by the current cudaHostRegister implementation. Then it is just up to you as a programmer to properly allocate the arrays that you may possibly want pinned later on.

Hope that helps.

- Brent
Back to top
View user's profile
LeviBarnes



Joined: 03 Oct 2012
Posts: 10

PostPosted: Fri Feb 08, 2013 2:00 pm    Post subject: Update? Reply with quote

Has the status of cudaHostRegister changed? Can I page-lock a chunk of host memory without iso_c_binding?

Even with iso_c_binding, I'm having no luck:

Code:
use iso_c_binding
double precision, target :: junk(1024)
double precision, device, allocatable :: d_junk(:)
integer :: istat

allocate(d_junk(1024))
istat = cudaHostRegister(c_loc(junk), 1024, 0)
istat = cudaMemcpyAsync(d_junk, junk, 1024)
deallocate(d_junk)


gives a runtime failure
Code:
0: copyin Memcpy (dev=0x500200000, host=0x68f040, size=8192) FAILED: 11(invalid argument)


without the call to cudaHostRegister, all is well. Can anyone see why?
Back to top
View user's profile
brentl



Joined: 20 Jul 2004
Posts: 132

PostPosted: Fri Feb 08, 2013 2:53 pm    Post subject: Reply with quote

Two possible issues here:

The current implementation of cudaHostRegister has not been overloaded to take all datatypes like many of the other API routines, so you need to pass it c_loc() of the array like you did. But, then, the count is in bytes, so you need to pass 1024*8.

That should probably do the trick. Then, if the Async memcpy is actually happening asychnronously, you might be running into trouble deallocating the array before the transfer happens! (Maybe the CUDA runtime synchronizes those two operations, I'm not sure...)

If you use the cudafor module, it includes iso_c_binding, so you don't need to use that explicitly.
Back to top
View user's profile
LeviBarnes



Joined: 03 Oct 2012
Posts: 10

PostPosted: Fri Feb 08, 2013 4:34 pm    Post subject: Thanks Reply with quote

That did it. Thanks.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group