PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

function/procedure calls not supported
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Saumik



Joined: 05 Jan 2012
Posts: 10

PostPosted: Tue Feb 14, 2012 10:50 pm    Post subject: function/procedure calls not supported Reply with quote

Hi Mat,
I recently compiled a code the structure of which is as follows:
SUBROUTINE XYZ(....)
.
.
.
.
.
.
!$ACC REGION
!$ACC DO PRIVATE(N)
DO 110 N = 1,NUMEL
.
.
.
.
.
.
.
CALL ELMLIB(...........)
.
.
110 CONTINUE
!$ACC END REGION
.
.
RETURN
END

I got the following error with reference to the call to the subroutine ELMLIB inside the structured block:
----Accelerator restriction: function/procedure calls not supported----

This, I presume, is with reference to the restriction that a program may not branch in or out of an accelerator region. At the same time, I cannot do away with the call to the subroutine inside the structured block. Could you suggest a workaround?

Thanks
Saumik.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Wed Feb 15, 2012 8:34 am    Post subject: Reply with quote

Hi Saumik,

All subroutines need to be inline before they can used within an accelerator region. For details about compiler automatic inline, please refer to Chapter 4 of the PGI User's Guide (https://www.pgroup.com/doc/pgiug.pdf). Basically, if the callee is located in the same file as the caller, you only need to add the flag "-Minline". Otherwise, you need to either create an extract library (-Mextract) or use IPA inlining (-Mipa=inline). However, to every routine can be automatically inlined, so on occasion you many need to manually inline the routine.

Hope this helps,
Mat
Back to top
View user's profile
Saumik



Joined: 05 Jan 2012
Posts: 10

PostPosted: Thu Feb 16, 2012 7:14 am    Post subject: Reply with quote

Hi Mat,
Could you give an example of a code where a loop is being parallelized and there are a couple of function calls within the loop and the corresponding (function inlining) compiler flag(s) to be appended? This would help me get started.

Thanks
Saumik.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Thu Feb 16, 2012 11:15 am    Post subject: Reply with quote

Hi Saumik,

Here's a very trivial example, but should show you how to use the inlining flags.

Case 1: When the callee and caller are in the same file.
Code:

l% cat test.f90
        function dosomething (x,y)
          real :: x,y
          real :: dosomething
          dosomething = x*y
        end function

        program foo
   real a(100), b(100)
!$acc region
   do i = 1,100
    a(i) = float(i) * 100
   enddo

   do i = 1,100
    b(i) = dosomething(a(i),a(i))
   enddo
!$acc end region
        print *, b(99), b(1)
   end
% pgf90 -ta=nvidia -Minfo=accel,inline test.f90
PGF90-W-0155-Accelerator region ignored; see -Minfo messages  (test.f90: 9)
foo:
      9, Accelerator region ignored
     14, Accelerator restriction: function/procedure calls are not supported
     15, Accelerator restriction: unsupported call to 'dosomething'
  0 inform,   1 warnings,   0 severes, 0 fatal for foo
% pgf90 -ta=nvidia -Minfo=accel,inline -Minline test.f90
foo:
      9, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     14, Loop is parallelizable
         Accelerator kernel generated
         14, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     15, dosomething inlined, size=2, file test.f90 (1)


Case 2: The callee and caller are in separate files. Compilation is done on same line.
Code:
l% cat do_mod.f90
    module do_mod
     contains
        function dosomething (x,y)
          real :: x,y
          real :: dosomething
          dosomething = x*y
        end function
    end module do_mod

l% cat test1.f90

        program foo
        use do_mod
   real a(100), b(100)
!$acc region
   do i = 1,100
    a(i) = float(i) * 100
   enddo

   do i = 1,100
    b(i) = dosomething(a(i),a(i))
   enddo
!$acc end region
        print *, b(99), b(1)
   end
% pgf90 -ta=nvidia -Minfo=accel,inline -Minline do_mod.f90 test1.f90
do_mod.f90:
test1.f90:
do_mod.f90:
test1.f90:
foo:
      5, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
          6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     11, dosomething inlined, size=2, file do_mod.f90 (3)


Case 3: The callee and caller are in separate files. Separate compilation using IPA.

Code:
% pgf90 -c -Mipa=inline do_mod.f90
% pgf90 -Mipa=inline -ta=nvidia -Minfo do_mod.o test1.f90
test1.f90:
PGF90-W-0155-Accelerator region ignored; see -Minfo messages  (test1.f90: 5)
foo:
      5, Accelerator region ignored
     10, Accelerator restriction: function/procedure calls are not supported
     11, Accelerator restriction: unsupported call to 'dosomething'
  0 inform,   1 warnings,   0 severes, 0 fatal for foo
IPA: no IPA optimizations for 1 source files
IPA: Recompiling test1.o: stale object file
foo:
      5, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
          6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     11, dosomething inlined, size=2 (IPA) file do_mod.f90 (3)


Case 4: The callee and caller are in separate files. Create an extract library.
Code:
% pgf90 -Mextract=lib:extlib do_mod.f90
% pgf90 -Mextract=lib:extlib test1.f90
% pgf90 -c do_mod.f90
% pgf90 -Minline=lib:extlib -ta=nvidia -Minfo -c test1.f90
foo:
      5, Generating copyout(b(:))
         Generating copyout(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
          6, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 40 constant, 0 local memory bytes; 50% occupancy
     10, Loop is parallelizable
         Accelerator kernel generated
         10, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 40 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 48 constant, 0 local memory bytes; 50% occupancy
     11, dosomething inlined, size=2, file do_mod.f90 (3)
l% pgf90 -ta=nvidia -Minfo test1.o do_mod.o


During development, I find using extract libraries as being the easiest method. Yes, it requires the extra extract step, but this only needs to be done once (unless the file changes). I can then simply recompile the source file with accelerator directives without needing to go through the link step with IPA just to see if my changes work.

Hope this helps,
Mat
Back to top
View user's profile
Saumik



Joined: 05 Jan 2012
Posts: 10

PostPosted: Fri Mar 02, 2012 10:15 am    Post subject: Reply with quote

Hi Mat,
The problem I am facing is that the subroutine definitions contain alternate return statements/data statements/format statements/assigned goto statements making them not inlinable. The fact remains that it is virtually impossible to tinker with these statements without destroying the existing structure of the code. Is there a workaround?

Thanks
Saumik.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group