PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Problems with FORTRAN Accelerator and subroutines
Goto page 1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Wed Jul 06, 2011 1:09 am    Post subject: Problems with FORTRAN Accelerator and subroutines Reply with quote

Hi! I am modifying a simple SWE (Shallow Water Equations) 1D in order to use PGI Accelerator 11.6 (and Visual Studio 2008) on a nVidia Tesla C2050.

The code has a structure like this:

Code:
CALL sub1
CALL sub2

DO i = 1,N

CALL sub3
...
CALL subM

END DO


I have used "!$acc reflected" and "!$acc region do" in every subroutine inside the DO loop (since those subroutines consist essentially of one or more DO loops):

Code:
MODULE mod3

CONTAINS

SUBROUTINE sub3

...

!$ acc reflected(arr1,...)

!$acc region do

DO i = 1,N

...

END DO

!$acc end region


Then I created a "!$acc data region copy" outside the main program DO loop to avoid continuos data transfer between the host and the device:

Code:
CALL sub1
CALL sub2

!$acc data region copy(arr1,...)

DO i = 1,N

CALL sub3
...
CALL subM

END DO

!$acc end data region


The results I obtain with the accelerated code are different from the non-accelerated code. It looks like values in the arrays are not being updated correctly. I also tried "mirror" and "update device" (bedore entering the main program's DO loop) combination, but i get an error (about copying datas from host to device).

I've uploaded the zipped VS 2008 project on MediaFire since it's a little bit long and it would have appeared as a mess on this post:

http://www.mediafire.com/?pobzr88jeaglzct


Thanks in advance for the help,

Nicola
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Wed Jul 06, 2011 7:38 am    Post subject: Reply with quote

Update: it looks like I solved the "results' problem". It was necessary to "!$acc update device" the first two arrays (Up, DxU) in the DO loop inside the main program and to remove the Accelerator's directives inside the Delta module.

Now it's time to increase the poor (*sigh*) performances of the code.
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Thu Jul 07, 2011 2:03 am    Post subject: Reply with quote

I've another question: which is a good strategy to improve performances on a DO loop with several IF statements inside? For example, a structure like this:

Code:
DO i = 1,N

...

  IF(i==1) THEN
     ...
  ELSE IF(i==N) THEN
    ...
  ELSE
    ...
    IF(clause1) THEN
      ...
    ELSE IF(clause2) THEN
      ...
    ELSE
      ...
    END IF

  END IF

...

END DO



Is there any chance of improving or do I need to think about a totally different approach? Sorry, if the question is so "basical", but this is the first time I get into parallel-programming world.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Thu Jul 07, 2011 9:26 am    Post subject: Reply with quote

Hi Nicola,

I'm assuming you mean that the do loop is within an accelerator region and the if statements are part of your device kernel.

While branching is allowed, it should be avoided since it can degrade GPU performance due to thread divergence. Threads within the same warp execute the same instructions at the same time, just on different data (i.e. SIMD). Hence, if different threads within the same warp take different branches, then they need to take turns issuing their instructions, causing slow downs. (Note a good primer on the NVIDIA CUDA threading model can be found at: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm).

So for your code, the first "i==1" and "i==N" aren't a big problem. Yes, you'll see a slight slow down, but it only occurs in two warps so not a big deal. You could hoist these cases outside the do loop and then iterate from i+1 to N-1, but this requires them to be executed on the host or in a serial device kernel. Probably not worth it, but you can experiment.

The interior if statement is more problematic. If you can arrange the data and schedule so that all the threads in a warp execute the same branch, then it's fine. Otherwise, you'll lose performance.

- Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Mon Jul 18, 2011 8:01 am    Post subject: Reply with quote

Hi Mat, I took some time in order to try different ways with PGI Accelerator and with PGI CUDA FORTRAN. I encountered some problem with both alternatives:

1) PGI Accelerator: I used the mirror/reflected clause in order to call the different subroutines. These routines are called inside a do loop which is located inside the main program. I want to speed-up the code by using an accelerator region which contains that do loop but the compiler gives me several errors due to the presence of some update clauses and calls to the routines:

Code:
DO i = 1,N

   Up   = U
   DxU   = 0.0d0
   
   !$acc update device(Up,DxU)

   Up   = U
   DxU   = 0.0d0

   ! Calcolo del passo temporale

   CALL Deltat(Dry,U,W,ULR)
   
   CALL Slopes(eta,Dry,U,DxU)

   CALL Predictor(zc,DxZ,eta,Dry,U,DxU,Up)

   CALL Reconstruction(zf,eta,Dry,U,DxU,ULR)

   CALL Fluxes(Dry,F,ULR)

   CALL SourceTerms(zc,zf,Dry,U,S0,Sf)

   CALL ComputeDU(Dry,F,S0,Sf,DtU)

   CALL Update(zf,eta,W,U,Dry,DtU)

   IF(t>tsim) i = N

END DO


Note that i changed the do-while loop into a do loop with an if statement in order to run it with the Accelerator.

2) PGI CUDA FORTRAN: in this version of the code I used a call to a global kernel subroutine (from the main program) which contains several calls to device kernel subroutines. The problem is that I get no computed values from those routines.

If you want to look at the second code, I have uploaded it on MediaFire:

http://www.mediafire.com/?ccl6fstjupr1hep



Many thanks for your help,

Nicola.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3, 4, 5  Next
Page 1 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group