PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Six Loops iteration and reduction
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Thu Mar 15, 2012 6:44 pm    Post subject: Reply with quote

Quote:

If you don't think you are getting good performance, what is the profile information (-ta=nvidia,time) telling you? Do you not have enough parallelizism (see the number of grids and blocks)? Is data movement causing the issue? Is your device initialisation time high?


I just set nm=5, then run to get this message,
Code:

call to cuMemAlloc returned error 2: Out of memory
CUDA driver version:4020

    31:region entered 1 time
         time(us): init=0


The GPU I use is GeForce GTS 450, and is the memory of it not enough for my computation?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Fri Mar 16, 2012 10:54 am    Post subject: Reply with quote

Quote:
The GPU I use is GeForce GTS 450, and is the memory of it not enough for my computation?
It could be your device, it could be the program, I'm not sure. I would need a reproducing example to tell what's wrong. Can you post one or send an example to PGI Customer Service (trs@pgroup.com) and ask them them to forward it to me?

- Mat
Back to top
View user's profile
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Fri Mar 16, 2012 7:19 pm    Post subject: Reply with quote

Hi Mat,
Thanks for your patient help!
Here is the complete code of my test program,
Code:

      program prog
     use accel_lib

      implicit none
      ! Variables
     integer::ik,iky,ip,ipy,iq,iqy
     integer::nm

     real,allocatable::ffq(:,:),ffqq(:,:)
     real::ffp,ffpp
     real::ffk,ffkk

     real::ffqq1,ffq1
     real::ffpc,ffppc
     real::ffkc,ffkkc

     integer::c0,c1,c2
      ! Body
     nm=20
     allocate(ffqq(nm,nm))
     !call acc_init(acc_device_nvidia)

     call system_clock(count=c0)

     ffkk=0.0

     !$acc region
!$acc do parallel, vector(16)
     do 30 ik=1,nm
!$acc do kernel, parallel, vector(16),private(ffqq),private(ffq)
     do 30 iky=1,nm
     
      ffqq=0.0
      do 201 ip=1,nm
      do 201 ipy=1,nm
         
         ffq=0.0
         do 10 iq=1,nm
         do 10 iqy=1,nm
            ffq(iq,iqy)=1.0/nm/nm
            !ffqq1=ffqq1+ffq
            !ffqq(ip,ipy)=ffqq1/2.0
10       continue
      ffqq(ip,ipy)=sum(ffq)


201      continue

      ffpp=0.0
      do 20 ip=1,nm
      do 20 ipy=1,nm
         ffp=ffqq(ip,ipy)/nm/nm
         ffpp=ffpp+ffp
20      continue

      ffk=ffpp/nm/nm
      ffkk=ffkk+ffk
30     continue
     !$acc end region

     call system_clock(count=c1)
     ffkk=ffkk
     write(*,*)'ffkk',ffkk
   
   !Check on the CPU   
     ffkkc=0.0
     do ik=1,nm*nm

      ffpp=0.0
      do ip=1,nm*nm
         
         ffqq1=0.0
         do iq=1,nm*nm
            ffq1=1.0/nm/nm
            ffqq1=ffqq1+ffq1
         enddo

         ffpc=ffqq1/nm/nm
         ffppc=ffppc+ffpc
      enddo

      ffkc=ffppc/nm/nm
      ffkkc=ffkkc+ffkc
     enddo
     call system_clock(count=c2)

     write(*,*)'ffkkc',ffkkc

     write(*,*)c1-c0,'ms on GPU'
     write(*,*)c2-c1,'ms on host'
      end program prog

I have tried many times, but still the same error in my last post turns up.
Back to top
View user's profile
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Mon Mar 19, 2012 9:10 am    Post subject: Reply with quote

Hi, Mat
I have sent the source code of program that I really want to accelerate to the PGI Customer Service.
I'll really appreciate you if you could spend your precious time to check it!

Thanks,
herohzy.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Mar 19, 2012 9:23 am    Post subject: Reply with quote

Hi herohzy,

The problem with the above code is that you forgot to allocate ffq. Hence, when the compiler goes to allocate ffq on the device, it's using a bogus size. When using accelerator directives, be sure your program is correct on the host, otherwise you may waste a lot time chasing down what appear to be odd GPU issues, but are really basic issues with your code.

I'll get your full program from customer support to is if it's the same problem there.

- Mat

Code:
% diff test_org.f90 test.f90
22c22
<
---
>      allocate(ffq(nm,nm))
% pgf90 test.f90 -ta=nvidia -Minfo=accel
prog:
     29, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     31, Loop is parallelizable
     33, Loop is parallelizable
         Accelerator kernel generated
         31, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         33, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.0 : 28 registers; 1112 shared, 44 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 59 registers; 1040 shared, 100 constant, 0 local memory bytes; 33% occupancy
         59, Sum reduction generated for ffkk
     35, Loop is parallelizable
     36, Loop carried reuse of 'ffq' prevents parallelization
     37, Loop carried reuse of 'ffq' prevents parallelization
     39, Loop is parallelizable
     40, Loop is parallelizable
     41, Loop is parallelizable
     46, Loop is parallelizable
     52, Loop is parallelizable
     53, Loop is parallelizable
% a.out
 ffkk   0.9999990
 ffkkc    200.6242
      5475039 ms on GPU
        63251 ms on host
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 2 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group