PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

CUDA Fortran and CUDA API 3D Arrays

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
TheMatt



Joined: 06 Jul 2009
Posts: 304
Location: Greenbelt, MD

PostPosted: Mon Jun 07, 2010 11:01 am    Post subject: CUDA Fortran and CUDA API 3D Arrays Reply with quote

In my quest to try out asynchronous memcpy with CUDA Fortran, I've figured out how to use cudaMalloc/Memcpy and cudaMallocPitch/Memcpy2D. Hooray!

But now I move on to the 3D API calls, and I'm wondering if anyone has any advice with these. Looking at the cuda headers, etc., with PGI, I can see that cudaExtent looks to be a simple TYPE, but I'm wondering if anything special needs to be done for cudaPitchedPtr or cudaMemcpy3DParms? And is there a subroutine similar to make_cudaPitchedPtr?

(This is all to say nothing about the *use* of these in the kernel!)
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Tue Jun 08, 2010 10:34 am    Post subject: Reply with quote

Hi Matt,

I'm out of the office so can test it myself, I've seen internal mail stating that cudaMalloc3D should be added in 10.6. Once I'm back next week, I can get you further details if you need them.

Thanks,
Mat
Back to top
View user's profile
Tuan



Joined: 11 Jun 2009
Posts: 233

PostPosted: Sat Nov 20, 2010 8:15 pm    Post subject: Re: CUDA Fortran and CUDA API 3D Arrays Reply with quote

TheMatt wrote:
In my quest to try out asynchronous memcpy with CUDA Fortran, I've figured out how to use cudaMalloc/Memcpy and cudaMallocPitch/Memcpy2D. Hooray!

But now I move on to the 3D API calls, and I'm wondering if anyone has any advice with these. Looking at the cuda headers, etc., with PGI, I can see that cudaExtent looks to be a simple TYPE, but I'm wondering if anything special needs to be done for cudaPitchedPtr or cudaMemcpy3DParms? And is there a subroutine similar to make_cudaPitchedPtr?

(This is all to say nothing about the *use* of these in the kernel!)


Hi TheMatt,
Could you please share some experience on using cudaMallocPitch() (some example) and cudaMalloc3D(), if possible.

Thanks,
Tuan
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 304
Location: Greenbelt, MD

PostPosted: Mon Nov 22, 2010 5:51 am    Post subject: Re: CUDA Fortran and CUDA API 3D Arrays Reply with quote

Tuan wrote:
Hi TheMatt,
Could you please share some experience on using cudaMallocPitch() (some example) and cudaMalloc3D(), if possible.

Thanks,
Tuan

Tuan,

I suppose so. What I did was pretty simple, though, in the end, it turned out it wasn't worth it (my case has pretty simple memory access so padding doesn't help much).

What I did in my test was like this. First in the driver:
Code:
integer :: m = 1782
integer :: np = 72

integer :: mnp_pitch, mnp1_pitch
integer :: istat

! Inputs
real, allocatable :: ta(:,:)
real, allocatable, device :: ta_dev(:,:)

!Outputs
real, allocatable :: flx(:,:)
real, allocatable, device :: flx_dev(:,:)

allocate(ta(m,np))
allocate(flx(m,np+1))

istat = cudaMallocPitch(ta_dev,mnp_pitch,m,np)
istat = cudaMallocPitch(flx_dev,mnp1_pitch,m,np+1)

(...initialize ta by reading in from file, say...)

istat = cudaMemcpy2D(ta_dev,mnp_pitch,ta,m,m,np)

call kernel<<<Grid,Block>>>(m,np,mnp_pitch,mnp1_pitch,ta_dev,...,flx_dev,...)

istat = cudaMemcpy2D(flx,m,flx_dev,mnp1_pitch,m,np+1)


What we see is that I use cudaMallocPitch to allocate the pitched memory and get the pitch itself which returns in the second element. As you can see here I was being a bit too careful by having a pitch for both m-by-np and m-by-np+1 arrays. This is overkill I'm pretty sure, but I wanted to make sure I didn't make a mistake.

The issue that got me was the order of the cudaMemcpy2D. You have to remember to get the src and dst pitches correct. That is, the first two elements are the destination array and pitch followed by the source array and pitch, and then the actual number of elements. Say the mnp_pitch is 1792 (which 32 divides unlike the actual m = 1782, not sure if that's what it will actually use), you don't want to use:
Code:
istat = cudaMemcpy2D(ta_dev,mnp_pitch,ta,m,mnp_pitch,np)

because you might have allocated (mnp_pitch,np) on the device, but there are still only (m,np) elements, the pitch is just telling it what to skip. (And, of course, on the host the pitch is m.)

Let me know if this does or does not make sense. As I said, I don't use this currently in my "production" work, but I'm sure soon enough I will, so if I made a mistake, it'd be good to know!

Matt
Back to top
View user's profile
JDS7



Joined: 19 Aug 2011
Posts: 17

PostPosted: Wed Oct 05, 2011 1:59 pm    Post subject: Reply with quote

Hi Matt,

I'm trying to use cudamallocpitch but pgfortran gives me this error:

"PGF90-S-0155-Could not resolve generic procedure cudamallocpitch"

This may be a dumb question, but how did you get your fortran code to compile with the call to cudamallocpitch? Mine looks just like yours:

r = cudaMallocPitch(udev, pitch, nx, ny*nz)

(where udev is an array of floats, and r, pitch, nx, ny, and nz are integers)

Thanks

Jim
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group