|
| View previous topic :: View next topic |
| Author |
Message |
michel.mueller
Joined: 30 Jan 2013 Posts: 2
|
Posted: Mon Mar 04, 2013 4:19 am Post subject: Slicing a device array in CUDA Fortran |
|
|
Hello there. I'm in the process of porting a relatively large project to CUDA Fortran and I've ran into an issue with one of the modules: Even after lots of optimization, it creates too much temporary device memory per thread to be able to run in a grid with more than, say, 64x32 threads. Now this is ok, as long as we can run the rest of the program using more threads, for example 256x256. In order to do this we would like to execute that one module in strides of 64x32 threads serially. Here comes the problem:
I can't figure out, how to slice device arrays (both intent(in) and intent(out)) in order to pass only a stride of the input/output arrays to the module subroutines. We don't want to copy to host and back for this, since that would impact performance too much. Here's what I've tried:
1) using standard Fortran array slicing notation, such as
| Code: | | call my_module_kernel_wrapper(myInput(strideBegin:strideEnd), myOutput(strideBegin:strideEnd)) |
outcome: "Profiled program has returned error code 139". Note: I haven't yet tried a minimal example as shown above, but I can't find any information about host error codes. I haven't run it using a profiler, this is just the message when executing it normally. Does anyone know whether the above notation is supported for device arrays?
2) using temporary device arrays (in host code), such as
| Code: |
real(8), dimension(stride), device :: myTemp
... ! index calculation strideBegin, strideEnd
myTemp = myInput(strideBegin:strideEnd)
|
outcome: "More than one device-resident object in assignment"
Apparently device-to-device copying is still not supported. One workaround that comes to mind would be a CUDA C helper function just for the device-to-device copy, but it's a bit of a hassle (more build steps and/or dependencies) I'd like to avoid if there is a better solution.
Does anyone have a hint on how I can achieve a device array slice without copying to the host and back? Thanks a lot in advance. |
|
| Back to top |
|
 |
brentl
Joined: 20 Jul 2004 Posts: 107
|
Posted: Mon Mar 04, 2013 3:57 pm Post subject: |
|
|
When you use array slice notation as a parameter, the compiler may try to create a temporary array for the slice, and that might be causing you problems.
One solution is to pass the starting address and a length. This is like the old F77 style.
call my_module_kernel_wrapper(myInput(strideBegin), strideEnd-strideEnd+1, myOutput(strideBegin))
declare the dummy arguments in module_kernel_wrapper with * rather than : in the array dimension.
You're right, we still have some problems supporting all types of device-to-device transfers using array syntax. We're working to address that in 1H of 2013. For now, you can use the cuda API, so this might work:
istat = cudaMemcpy(myTemp, myInput(strideBegin), strideEnd-strideBegin+1, cudaMemcpyDeviceToDevice)
In effect, we've written the C wrappers for you. Just "use cudafor" in the program unit where you do this and the API is available. Note that usually
the count is in units of the datatype, not bytes like it is in C.
Finally, if the striding is sort of complicated, you can code it yourself. CUF kernels might be good for this:
!$cuf kernel do <<< *, * >>>
do i = 1, strideEnd-strideBegin+1
myTemp(i) = myInput(strideBegin+i-1)
end do
This gives you some control over the number of blocks and threads that take part in the copy. |
|
| Back to top |
|
 |
michel.mueller
Joined: 30 Jan 2013 Posts: 2
|
Posted: Mon Mar 04, 2013 7:59 pm Post subject: |
|
|
| Brent, thank you a ton - your input was very very helpful. At the end I went with the Fortran 77 notation since I'm already passing in the length anyway. It works, even with 2D and 3D arrays (which I didn't mention above to keep things simple). I might also mention that this implementation is actually part of a framework where we have a hybrid codebase that works for both CPU and GPU - thanks to your help I was able to keep the GPU-only code in the wrapper to a minimum. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|