PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

call thrust::min_element function from cuda fortran
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
brute11k



Joined: 22 Apr 2011
Posts: 9

PostPosted: Tue Apr 10, 2012 3:01 pm    Post subject: call thrust::min_element function from cuda fortran Reply with quote

Hello there!

I'm trying to figure out how should I describe thrust function called
"min_element" in interface section:

cmin.cu:

Code:
#include <thrust>
#include <thrust>
#include <thrust>

extern "C" {
//Min for float arrays
void min_float_wrapper( float *data, int N)
{

thrust::device_ptr <float> dev_ptr(data);
thrust::min_element(dev_ptr, dev_ptr+N);
}
}


kernel.cuf:

Code:

interface thrustmin

   subroutine min_float(input,N) bind(C,name="min_float_wrapper")
   use iso_c_binding
   real(c_float),device:: input(*)
   integer(c_int),value:: N
   end subroutine

end interface


But of course this is incorrect, because thrust::min_element should return a pointer to the element, like this:

Code:
#include <thrust>
...
int data[6] = {1, 0, 2, 2, 1, 3};
int *result = thrust::min_element(data, data + 6);
// result is data + 1
// *result is 0


I'm using http://cudamusing.blogspot.com/2011/06/calling-thrust-from-cuda-fortran.html as a reference, but there is no example on how I should convert C++ pointer to Fortran pointer. I've read PGI Fortran Reference manual, page 371: Interoperability with C.
It says:

Quote:
c_f_pointer: A subroutine that assigns the C pointer target to the Fortran pointer, and specifies its shape.
Syntax
c_f_pointer (cptr, fptr [,shape] )


But I'm still confused on syntax in interface section. Could you demonstrate simple example on how I can make it work?

My current sources are here:
http://paste.pocoo.org/show/579214/ (cmin.cu)
http://pastebin.com/PwWAYDWs (kernel.cuf)

I've tried to make some calculations on pointers (in wrap function) to return only index of minimum element (not C pointer), but I can't find good explanation of class hierarchy (or just how can I convert dev_ptr to integer) in thrust library.

It should compile ok on Windows with:
nvcc -m32 -c -arch sm_13 cmin.cu --cl-version 2008 -o a.obj (Microsoft Visual Studio 2010 Command Prompt)
and then:
pgf90 -V -m32 kernel.cuf a.obj (PGI WorkStation 12.3)

But when I launch it, then error shows: http://i.imgur.com/NZoGF.png

It seems error shows after this line (in cmin.cu):
thrust::min_element(dev_ptr, dev_ptr+5);

Thank you for your hard work!
Back to top
View user's profile
brute11k



Joined: 22 Apr 2011
Posts: 9

PostPosted: Wed Apr 11, 2012 1:02 pm    Post subject: Reply with quote

Okay, somehow I've managed to make it work.
http://pastebin.com/1zwzLREr - Fortran code
http://pastebin.com/XpRGTfhM - C++ wrap functions code

Though, not all seems right. When I compile cmin.cu

Code:
nvcc -m32 -c -arch sm_10 cmin.cu --cl-version 2008 -o a.obj


It works as expected (this is output: http://pastebin.com/rHEBArZH). Then, if I want to run program on double, I should rewrite this as follows:

Code:
nvcc -m32 -c -arch sm_13 cmin.cu --cl-version 2008 -o a.obj


Then I've got this problem after executing: http://i.imgur.com/NZoGF.png

I'm compiling kernel always with this line:

Code:
pgf90 -m32 kernel.cuf a.obj -Wl,/merge:.nv_ftab=.data


Is this expected behaviour for my system? I have NVIDIA Geforce 9600GT and Windows XP SP3. If you have graphic card with computing capability of 1.3 and higher I would like to hear if this works or not, because I can't test this right now.

PGI Workstation version is 12.3
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Apr 11, 2012 2:14 pm    Post subject: Reply with quote

Hi brute11k,

I'm glad you were able to get things working.

Quote:
Is this expected behaviour for my system? I have NVIDIA Geforce 9600GT and Windows XP SP3.
According to all knowing Wikipedia, the Geforce 9600GT is a CC1.0 card (See: http://en.wikipedia.org/wiki/CUDA#Supported_GPUs) so is not expected to work with double precision.
Quote:
If you have graphic card with computing capability of 1.3 and higher I would like to hear if this works or not, because I can't test this right now.
I haven't gotten a chance to install thrust on my system yet, but will give your program a try once I do.

- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Apr 12, 2012 2:31 pm    Post subject: Reply with quote

Hi brute11k,

I tested your code using a C2050 on 64-bit Linux using CUDA 4.1 and PGI 12.3. Seems to work fine. My only suggestion would be to adjust the return value by one. Right now it's returning the zero-based index (C/C++) but should be one-based (Fortran).

Code:
% nvcc -m64 -c -arch sm_20 cmin.cu -o cmin.o
% pgfortran -V12.3 -Mcuda=4.1 kernel.cuf cmin.o
kernel.cuf:
% a.out
 host_array:
   1.8801760E-02   0.6125335       0.5604655       0.7881591     
   0.6007245       0.7127076       0.5583790       0.8871369     
   0.6943535       0.3582030       0.2493081       0.6643550     
   0.3236073       0.7110850       0.1861875       0.8814021     
   0.2928087       0.3424091       0.3236184       0.7466530   
 COPY!
 START CUDA! sort:
 START CUDA! min:
 DONE  CUDA!
 Min:            0
% a.out
 host_array:
   0.8871369       0.6943535       0.3582030       0.2493081     
   0.6643550       0.3236073       0.7110850       0.1861875     
   0.8814021       0.2928087       0.3424091       0.3236184     
   0.7466530       0.6695611       0.8935332       5.5116676E-02
   0.8819974       0.6337899       0.3639146       0.2517362   
 COPY!
 START CUDA! sort:
 START CUDA! min:
 DONE  CUDA!
 Min:           15


- Mat
Back to top
View user's profile
brute11k



Joined: 22 Apr 2011
Posts: 9

PostPosted: Fri Apr 13, 2012 8:18 am    Post subject: Reply with quote

mkcolg wrote:
Hi brute11k,

I tested your code using a C2050 on 64-bit Linux using CUDA 4.1 and PGI 12.3. Seems to work fine. My only suggestion would be to adjust the return value by one. Right now it's returning the zero-based index (C/C++) but should be one-based (Fortran).

- Mat


Thank you very much, Mat! You've really helped me so much since then.
Well, about zero-based indexes, I'm using array unrolling to avoid designing functions for various dimensions (because I'm testing cases for 1D, 2D and will soon for 3D).

For example:

Code:
real :: 3DArray(n,m,k), 1DArray(n*m*k)
3DArray(i, j, l)=1DArray(i*m*k + j*k + l)


So, function looks like that:

Code:
attributes(global) subroutine cuda_calculate_fluidhalf(in, innext, outhalf)
   type(Fluids), device :: in, outhalf, innext
    integer :: idx

   !calculate q(n+1/2):

    idx = M * K * (threadIdx%z + (blockIdx%z - 1) * blockDim%z - 1) + K * (threadIdx%y + (blockIdx%y - 1) * blockDim%y - 1) + (threadIdx%x + (blockIdx%x - 1) * blockDim%x - 1) + 1
    outhalf%density(idx) = ( in%density(idx) + innext%density(idx) ) / 2.0
    outhalf%energy(idx)  = ( in%energy(idx) + innext%energy(idx) ) / 2.0
    outhalf%r(idx,:)     = ( in%r(idx,:) + innext%r(idx,:) ) / 2.0
    outhalf%U(idx,:)     = ( in%U(idx,:) + innext%U(idx,:) ) / 2.0

end subroutine cuda_calculate_fluidhalf


Fluids is a structure of arrays:

Code:
integer, parameter :: dimen = 1
integer, parameter :: grid  = 384
type Fluids
   real(kind=prk) :: density(grid**dimen), energy(grid**dimen)
   real(kind=prk) :: U(grid**dimen,dimen) !U_x(:), U_y(:), U_z(:) ! speed components
   real(kind=prk) :: r(grid**dimen,dimen) !x(:), y(:), z(:)
end type


Well, if I try to rewrite my currently "universal" code to something like that:

Code:
   type Fluids3D
      real(kind=prk) :: density(N, M, K), energy(N, M, K)
      real(kind=prk) :: U(N, M, K, dimen)
      real(kind=prk) :: r(N, M, K, dimen)
   end type

attributes(global) subroutine cuda_calculate_fluidhalf(in, innext, outhalf)

   type(Fluids), device :: in, outhalf, innext
    integer :: i, j, k

   !calculate q(n+1/2):
    i   = threadIdx%z + (blockIdx%z - 1) * blockDim%z - 1
    j   = threadIdx%y + (blockIdx%y - 1) * blockDim%y - 1
    k   = threadIdx%x + (blockIdx%x - 1) * blockDim%x - 1

    outhalf%density(i, j, k) = ( in%density(i, j, k) + innext%density(i, j, k) ) / 2.0
    outhalf%energy(i, j, k)  = ( in%energy(i, j, k) + innext%energy(i, j, k) ) / 2.0
    outhalf%r(i, j, k,:)     = ( in%r(i, j, k,:) + innext%r(i, j, k,:) ) / 2.0
    outhalf%U(i, j, k,:)     = ( in%U(i, j, k,:) + innext%U(i, j, k,:) ) / 2.0

end subroutine cuda_calculate_fluidhalf


You'll see that this is sort of a ugly code, because every time we need to access variables "i, j, k" to get the needed value. "idx" here looks more compact, faster (well, not really that fast, but I can give you sources so you can test it by yourself: http://pastebin.com/JY9nUpHJ) and saving memory. Moreover, it allows me to switch between test cases a lot faster

Thank you for your time!
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group