PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

CUDA Fortran and Fortran 77
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
jeremyw



Joined: 08 Mar 2012
Posts: 7

PostPosted: Fri Mar 09, 2012 4:20 pm    Post subject: CUDA Fortran and Fortran 77 Reply with quote

I have an old Fortran 77 code that I'm trying to migrate to a GPU. I basically want to execute this old code N times, hence the GPU is a perfect fit. I wrote what is essentially a driver program to accept arrays of variables (1xN) and send them to the GPU subroutine to execute my old Fortran 77 code with one set of values at a time. In a nutshell, the GPU subroutine is going to have maybe a few dozen lines and then the rest I'd like to paste in the Fortran 77 code. Is there any way I can compile my GPU subroutine using pgfortran and include compatibility for the legacy versions of Fortran?

Thanks in advance!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Fri Mar 09, 2012 5:07 pm    Post subject: Reply with quote

Hi jeremyw,

CUDA Fortran does require a few F90 features, notably interfaces when calling global device routines. F90 Modules and Allocatable arrays make things easier, but are not required. So you should be fine provided you can partition your program accordingly.

One thing you may consider is to use the PGI Accelerator/OpenACC directive based approach. The only time F90 is required is when passing device data between routines. If your GPU code is contained within a single subroutine, directives will make it much easier to port as well as be more portable.

Hope this helps,
Mat
Back to top
View user's profile
jeremyw



Joined: 08 Mar 2012
Posts: 7

PostPosted: Fri Mar 09, 2012 5:44 pm    Post subject: Reply with quote

Hi Mat,

Thanks for your prompt reply. Unfortunately the F77 code I have is one primary subroutine that does the bulk of the work but also has maybe 15-20 small subroutines also in the code that it calls.

I've attempted to compile the F77 code using PGFORTRAN and it spits out all kinds of errors. It seems like there are major changes in the structure of the fortran coding between versions that it would require a major rewrite to get it to work. For example, here is a logical variable definition from the F77 file...

LOGICAL PLAST,DEBUG,ROUGH,UNFLAG,TOTAL,ITS,ELASTIC,
$ DEATH,FLAG1D,FLAG2D,RELOAD

To get this to work with PGFORTRAN I'd have to do something like,

LOGICAL PLAST,DEBUG,ROUGH,UNFLAG,TOTAL,ITS,ELASTIC, &
DEATH,FLAG1D,FLAG2D,RELOAD

I have tried compiling the old code with pf77 and it works fine. It seems like there's got to be an easier way of doing this that doesn't involve rewriting the old F77 code... Is there anyway to use a compiled object file and call it from the GPU subroutine?

Thanks again for the help.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Fri Mar 09, 2012 5:58 pm    Post subject: Reply with quote

Hi jeremyw,

What are the errors? pgfortran should be able to compile F77 code with problems. The problem you note with the continuation line looks more like a Fixed versus Free format issue rather then pgf77 vs pgfortran.

What flags do you use to compile? What is the extension of the file?

By default, files with ".f" or ".F" suffix are compiled using Free form while ".f90" or ".F90" are Fixed form. So if you had renamed your Free form files to use a f90 extension, then this would cause errors. Errors would also occur if you used the "-Mfixed" flag to force Fixed format.
Quote:
Is there anyway to use a compiled object file and call it from the GPU subroutine?
If you are writing CUDA Fortran code, then you can call routines with the device attribute. However, the global kernel and the device routines need to be contained within the same module.

For the directives, these routines would need to be inlined, either manually, or automatically by the compiler via the "-Minline" or "-Mipa=inline" flags.

- Mat
Back to top
View user's profile
jeremyw



Joined: 08 Mar 2012
Posts: 7

PostPosted: Mon Mar 12, 2012 9:11 am    Post subject: Reply with quote

Hi Mat,

Thanks for all of the suggestions. I've come a long way since my last post. My code is up and running successfully on the GPU but now I'm tasked with the much more difficult job of optimization. Turns out the translation from F77 to F90 wasn't as bad as I initially thought, it primarily just involved changing C's and *'s to !'s, changing $'s to &'s, and adding declarations to subroutines. All the errors I was seeing were indirectly related to one of these three things so they all started to go away very quickly once I made the translation.

Anyways, now that I'm up and running I'm a little bit surprised how poorly the bandwidth is between device and host memory in my application.

Running bandwidthTest on my Tesla M2090, I'm seeing speeds of right around 4GB/sec for pageable memory which seems reasonable. In my application, I'm basically just sending some really large arrays over to the device memory. One of these arrays in particular is by far the largest and dominates the transfer time. It's a 3D array of size 6x8xM of double precision variables, where M is some arbitrary length. For the sake of testing, I defined M equal to 100,000. Therefore, the size of the 6x8xM is 38400000 bytes, or 0.0384 gigabytes. At a speed of 4GB/sec this array should take somewhere in the ballpark 10ms to send to the device. When I run the application I'm seeing a total time of about 40ms, therefore the transfer is happening at about 1GB/sec. Do you have any ideas why I'm seeing such a poor performance? This transfer time is absolutely killing my runtime. Right now it takes 40ms to transfer to the device while only 2.5ms to run the kernel M times.

For reference, I'm allocating this array by:

allocate(arr_dp_dev(6,8,M))

and transferring to device by:

arr_dp_dev(1:6,1:8,1:M) = arr_dp_host(1:6,1:8,1:M)


On a side note, I've been compiling with the flag -Mcuda=4.1 for the toolkit I have. I just noticed this morning I tried -Mcuda, -Mcuda=cc2.x and -Mcuda=keepbin and they all three hang the compiler. For some reason the 4.1 flag is the only way it will compile and run. Any thoughts as to why this is?

Thanks again for the assistance!
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group