PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

declarative data error in PGI Fortran 10
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Nov 30, 2010 2:41 pm    Post subject: Reply with quote

Hi sindimo,

Implicit data regions will be available in the upcoming 2011 (11.0) release.

Note that 'declclause' in the documentation is meant be one of the clauses (copy, copying, etc) not a clause itself. Hence, change:

Code:
!$acc declclause copyin(A) copyin(B)
to
Code:
!$acc copyin(A) copyin(B)


However, since the copy to the device happens at the line where you added the directive, you're actually coping junk values to the device. Instead, you need to have A and B copied after they are initialized. The simplest way to do this would be to add 'copyin(A,B)" to the explicit data region (this will work in 10.9 as well). To use an implicit data region you would need to declare A and B as local and then use the updatein clause to copy the data to the device. (See example below)

Finally, since the total size of your arrays are larger than 2GB, you will need to compile with the "-mcmodel=medium" flag.

Hope this helps,
Mat

Code:
% cat decl.f90

       program main
         call MM()
         call MM()
         end program main


         subroutine MM ()
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         real start, finish
!$acc local(A,B)

      call srand(86456)
      do i = 1, dim1
        do j = 1, dim2
          A(i, j) = rand()
        enddo
      enddo
      do i = 1, dim2
        do j = 1, dim3
          B(i, j) = rand()
        enddo
      enddo

!$acc updatein(A,B)

      call cpu_time(start)

!$acc data region copyout(C)
! in 10.9 and earlier use
!  acc data region copyin(A,B), copyout(C)

!$acc region
        do j = 1, dim3
        do i = 1, dim1
          C(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            C(i, j) = C(i, j) + A(i, k)*B(k, j)
          enddo
        enddo
       enddo
!$acc end region

!$acc end data region

     call cpu_time(finish)
     print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(', &
      dim2,',',dim3,') is',finish - start,' s'

      end subroutine MM

% pgf90 -ta=nvidia -Minfo=accel decl.f90 -V11.0 -mcmodel=medium
mm:
     13, Generating local(b(:,:))
         Generating local(a(:,:))
     27, Generating !$acc update device(b(:,:))
         Generating !$acc update device(a(:,:))
     31, Generating copyout(c(:,:))
     33, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         35, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.3 : 8 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 40 constant, 0 local memory bytes; 100% occupancy
     38, Loop carried reuse of 'c' prevents parallelization
     39, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         38, !$acc do seq(16)
             Cached references to size [16x16] block of 'a'
             Cached references to size [16x16] block of 'b'
         39, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'c'
             CC 1.3 : 26 registers; 4400 shared, 32 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 40 registers; 4360 shared, 60 constant, 0 local memory bytes; 50% occupancy
% a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    47.59940      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    47.10842      s
Back to top
View user's profile
sindimo



Joined: 30 Nov 2010
Posts: 29
Location: Saudi Aramco

PostPosted: Wed Dec 01, 2010 12:43 am    Post subject: Reply with quote

Dear Mat, thanks for your quick response and clarifications.

One more side question please, how were you able to determine that the total size of the arrays are larger than 2GB, can the PGI compiler point that out during compilation?

This would be helpful since what we are trying to accomplish here is migrate one of our programs to run on GPUs and our goal with experimenting with the data regions is to try to minimize the data movement as much as possible between the CPU and GPU since we are seeing a bottleneck in the PICe communication between the two (data movement is taking much more time than actual kernel processing time which is causing an overhead).

The code I posted earlier is just a simple matrix multiplication which mimics a portion of our actual program that we are trying to port to GPUs using the PGI directives, the array sizes of the actual program might be even larger than this one.

If you have any suggestions or hints from your experience on reducing data movement, please feel free sharing it.

Many thanks for your help.


Last edited by sindimo on Tue Dec 07, 2010 12:32 pm; edited 1 time in total
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Dec 01, 2010 9:39 am    Post subject: Reply with quote

Hi Mohamad Sindi,

Quote:
One more side question please, how were you able to determine that the total size of the arrays are larger than 2GB, can the PGI compiler point that out during compilation?
The first time I tried to compile your program I got a "relocation truncated to fit: R_X86_64_PC32 against symbol ..." error. They typically means that your static data size is greater than 2GB. I then looked a code and found you have 3 10000x10000 double precision arrays, totaling 2.2GB.

Quote:

If you have any suggestions or hints from your experience on reducing data movement, please feel free sharing it.

The "reflected" and "mirror" directives will be available in the 11.0 release. Reflected will allow you have data regions that span across subroutine calls thus allowing you keep more data over on device for larger portions of your code.

Mirrored is applied to module data allocatable arrays. It will mirror the allocation status of the array on the host and gpu. The caveat is that you will need to manage the data movement using the update directives and be carefully to understand which copy of the array, host or device, you're working on.

Other things:
- have a basic understand the GPU architecture. This article is good place to start http://www.pgroup.com/lit/articles/insider/v2n1a5.htm
- get good at understanding the -Minfo=accel messages that the compiler prints during compilation. They hold valuable clues on ways to improve performance.
- Specifically, look for "non-stride-1" messages. This means that the GPU threads are not sequentially accessing global data. Data movement between the device's global and local memory can be more important to performance than optimizing the host/GPU data movement.
- compile with basic profiling enabled during development, i.e. "-ta=nvidia,time". This will highlight the performance bottle necks and show the actual cost to move data to the GPU.
- Make sure your algorithm is data parallel and can utilizes 10's of thousands of threads. Without this you may still be able to run your code on a GPU, but the performance will most likely be poor.

Please feel free to ask specific questions as you encounter them. While the PGI Accelerator model does make GPU programing easier, GPU program is still hard (at least at first).

- Mat


Last edited by mkcolg on Wed Dec 08, 2010 1:31 pm; edited 2 times in total
Back to top
View user's profile
sindimo



Joined: 30 Nov 2010
Posts: 29
Location: Saudi Aramco

PostPosted: Sun Dec 05, 2010 6:20 am    Post subject: Reply with quote

Thanks Mat for your valuable feedback!
Back to top
View user's profile
sindimo



Joined: 30 Nov 2010
Posts: 29
Location: Saudi Aramco

PostPosted: Mon Dec 27, 2010 3:13 am    Post subject: Reply with quote

Dear Mat, I hope you're doing well and happy holidays.

It seems that PGI 11 was released on Dec 22 so I have it installed now and very eager to test the new "reflected" feature.

As you mentioned previously, reflected should allow us to have data regions that span across subroutine calls, thus allowing to keep more data over on device for a larger portion of the code. This will also help avoid having to do manual inlining.

I have the below program in which I try to multiple matrices A and B several times while only loading A and B once into the GPU's memory. Ultimate goal it to reduce data copying between CPU and GPU using "reflected".

I am just trying to figure out how to do it using a simple program before actually implementing it in my actual application.

I tried following an example you posted earlier:
http://www.pgroup.com/userforum/viewtopic.php?t=2202&postdays=0&postorder=asc&start=10

However I get the below error regarding "EventSynchronize" when I run my program:
Code:

[sindimo@slcb100 working-fortran-example-with-gpu]$ /usr/local/pgi11/linux86-64/11.0/bin/pgfortran -fast -ta=nvidia,time -Minfo=all,accel -mcmodel=medium -Minline reflected.f
main:
     12, Loop not vectorized/parallelized: contains call
     17, Loop not vectorized/parallelized: contains call
     23, Loop not vectorized/parallelized: contains call
     24, Generating copyout(c(:,:))
         Generating copyin(b(:,:))
         Generating copyin(a(:,:))
mm:
     40, Generating reflected(b(:,:))
         Generating reflected(a(:,:))
     45, Generating copyout(c(1:10000,1:10000))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     46, Loop is parallelizable
     47, Loop is parallelizable
         Accelerator kernel generated
         46, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         47, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.3 : 8 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 40 constant, 0 local memory bytes; 100% occupancy
     50, Loop carried reuse of 'c' prevents parallelization
     51, Loop is parallelizable
         Accelerator kernel generated
         46, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         50, !$acc do seq(16)
             Cached references to size [16x16] block of 'a'
             Cached references to size [16x16] block of 'b'
         51, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'c'
             CC 1.3 : 25 registers; 4400 shared, 24 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 35 registers; 4360 shared, 60 constant, 0 local memory bytes; 50% occupancy

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 3010

Accelerator Kernel Timing data
reflected.f
  mm
    45: region entered 1 time
        time(us): init=1
        47: kernel launched 1 times
            grid: [625x625]  block: [16x16]
            time(us): total=16549 max=16549 min=16549 avg=16549
        51: kernel launched 1 times
            grid: [625x625]  block: [16x16]
            time(us): total=0 max=0 min=0 avg=0
reflected.f
  main
    24: region entered 1 time
        time(us): init=1361723
                  data=566929


If I comment out the data region and the reflected directive, it works fine:
Code:

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    38.48516      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.61761      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.61651      s

Accelerator Kernel Timing data
reflected.f
  main
    25: region entered 3 times
        time(us): total=111719220 init=1517525 region=110201695
                  kernels=108132453 data=2010386
        w/o init: total=110201695 max=36967588 min=36616499 avg=36733898
        25: kernel launched 6 times
            grid: [625x625]  block: [16x16]
            time(us): total=108132453 max=36028300 min=16542 avg=18022075


This is the code, I am not sure why it's not working when using the reflected directive:
Code:

[sindimo@slcb100 working-fortran-example-with-gpu]$ cat reflected.f
         program main

         use accel_lib

         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)

              !populate 2 random matrices
              call srand(86456)
                do i = 1, dim1
                do j = 1, dim2
                  A(i, j) = rand()
               enddo
               enddo
               do i = 1, dim2
               do j = 1, dim3
               B(i, j) = rand()
               enddo
               enddo

           !Trying to multiple the 2 matricies several times (only load them once into the GPU memory)
!$acc data region copyin(A,B) copyout(C)
           do i = 1, 3
             call MM(A,B,C)
           enddo
!$acc end data region

         end program main


         subroutine MM (A,B,C)
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         real start, finish

!$acc reflected(A,B)     

      call cpu_time(start)


!$acc region
        do j = 1, dim3
        do i = 1, dim1
          C(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            C(i, j) = C(i, j) + A(i, k)*B(k, j)
          enddo
        enddo
       enddo
!$acc end region


      call cpu_time(finish)

      print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
     1dim2,',',dim3,') is',finish - start,' s'
     
      end subroutine MM


Many thanks for your help, I really appreciate it.

Mohamad Sindi


Last edited by sindimo on Tue Dec 28, 2010 1:09 am; edited 1 time in total
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group