PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

data region problem

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
WmBruce



Joined: 18 May 2009
Posts: 14

PostPosted: Thu Jul 29, 2010 11:32 am    Post subject: data region problem Reply with quote

Hi again,

I'm taking an advice from one of my previous posts and tried playing around with the data region directive. From my understand !$acc data region local(...) would make all arrays inside the list local within the data region. Yet that's not the case in my program

Code:

program main

!=====( Initialization )===========================!
    use accel_lib

    real, dimension(:), allocatable :: randnums, seeds
    double precision, dimension(:), allocatable :: particles
    real, parameter :: PI = 4 * atan(1.0)
    integer :: numparticles, numseeds, seedinit, count(200)
    integer, parameter :: K4B=selected_int_kind(9)
   integer(K4B), parameter :: IA=16807,IM=2147483647,IQ=127773,IR=2836
   real :: am
   integer(K4B) :: gix,giy,gk,ix,iy,k
   
    write(*,*) "How many particles?"
    read(*,*) numparticles
    write(*,*) "Initial seed?"
    read(*,*) seedinit
   
    numseeds = int(sqrt(200.0 * numparticles)) + 1
    !future note: set some limit to the number of random numbers since the array might
    !not be able to hold all of it
   
    allocate(particles(numparticles))
   allocate(seeds(numseeds))
   allocate(randnums(numseeds ** 2))
   
    call acc_init(acc_device_nvidia)

!=====( Generate Random Numbers )==================!
   !$acc data region local(particles, seeds, randnums), copyout(count)
   !$acc region
      am = nearest(1.0,-1.0)/IM
      iy=ior(ieor(888889999,abs(seedinit)),1)
      ix=ieor(777755555,abs(seedinit))
      
      !$acc do kernel
       do j = 1, numseeds
         ix=ieor(ix,ishft(ix,13))
         ix=ieor(ix,ishft(ix,-17))
         ix=ieor(ix,ishft(ix,5))
         k=iy/IQ
         iy=IA*(iy-k*IQ)-IR*k
         if (iy < 0) iy=iy+IM
         seeds(j)=am*ior(iand(IM,ieor(ix,iy)),1)
      end do
   
      !$acc do vector(256), parallel, independent
        do j = 1, numseeds
         giy=ior(ieor(888889999, int(1000 * seeds(j)) + 1), 1)
         gix=ieor(777755555, int(1000 * seeds(j)) + 1)
         
         !$acc do seq
         do jj = 1, numseeds
            gix=ieor(gix,ishft(gix,13))
            gix=ieor(gix,ishft(gix,-17))
            gix=ieor(gix,ishft(gix,5))
            gk=giy/IQ
            giy=IA*(giy-gk*IQ)-IR*gk
            if (giy < 0) giy=giy+IM
            randnums((j - 1) * numseeds + jj) = am*ior(iand(IM,ieor(gix,giy)),1)
         end do
        end do
       
!=====( Main )=================================!     
      !$acc do vector(256), parallel, independent
      do j = 1, numparticles
         particles(j) = 1
      end do
   !$acc end region
   
   do i = 1,20
      print *,randnums(i)    
   enddo
      
   do j = 1,200
      !$acc region do vector(256), parallel, independent   
         do jj = 1, numparticles
            if (randnums((j - 1) * numparticles + jj) .lt. 0.1) particles(jj) = 0 
         end do

         
         do jj = 1,numparticles
            if (particles(jj) .eq. 1) count(j) = count(j) + 1
         enddo
         
         if (count(j) .eq. 0) goto 100
      enddo
100      continue
   
   !$acc end data region

   open(unit = 2, file = 'data.txt')
   
    write(2,1000) 0, numparticles
   
    do i = 1, 200
      write(2,1000) i, count(i)
      if (count(i) .eq. 0) exit
   end do
   
   write(2,*) "c   Cycle Number   Number of particles"
   
1000   format (i5,i10)

end program


and all the prints of randnums give me 0.000000. It seems like randnums was made local only within the region directive and not the data region directive.

I also have a few question regarding data transfer

Code:

     31, Generating local(randnums(:))
         Generating local(seeds(:))
         Generating local(particles(:))
         Generating copyout(count(:))
     32, Generating compute capability 1.3 binary
     38, Loop carried scalar dependence for 'ix' at line 39
         Loop carried scalar dependence for 'iy' at line 42
         Loop carried scalar dependence for 'iy' at line 43
         Inner sequential loop scheduled on host
     49, Loop is parallelizable
         Accelerator kernel generated
         49, !$acc do parallel, vector(256)
             Using register for 'seeds'
             CC 1.3 : 15 registers; 20 shared, 100 constant, 0 local memory byte
s; 100 occupancy
     54, Loop carried scalar dependence for 'gix' at line 55
         Loop carried scalar dependence for 'giy' at line 58
         Loop carried scalar dependence for 'giy' at line 59
         Complex loop carried dependence of 'randnums' prevents parallelization
     67, Loop is parallelizable
         Accelerator kernel generated
         67, !$acc do parallel, vector(256)
             CC 1.3 : 4 registers; 24 shared, 68 constant, 0 local memory bytes;
 100 occupancy
     77, Generating compute capability 1.3 binary
     78, Loop is parallelizable
         Accelerator kernel generated
         78, !$acc do parallel, vector(256)
             CC 1.3 : 5 registers; 20 shared, 56 constant, 0 local memory bytes;
 100 occupancy


it says that line 38 (do loop after !$acc do kernel in my code) is executed on the host while 49 (the do loop right after) is on the device. If the array 'seeds' is local to the device (caused by the data region directive) does the value of 'seeds' from the host get carried over to the device?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Thu Jul 29, 2010 12:54 pm    Post subject: Reply with quote

Hi WmBruce,

The GPU and CPU's memory are distinct from each other. Any update you make to the GPU copy does not automatically get updated on the host. The "updateout" and/or "updatein" clauses should be used to synchronize the host and GPU memory.

For eaxample, when you print out 'randnums', you're printing the host's copy of the array, which is zero. Adding a "!$acc updateout(randnums)" before the print loop will update the host values.

Similarly, the following loop is operating on the host 'particles' array which needs it's values updated to match the GPU's copy:
Code:

        !$acc updateout(particles)         
        do jj = 1,numparticles
            if (particles(jj) .eq. 1) count(j) = count(j) + 1
         enddo


Some other issues that I see are with the loop with initializes your 'seeds'. This loop has loop carry dependencies on the 'ix' and 'iy' variables which prevents it from being parallelized and instead being schedule on the host. This means only your host side copy of 'seeds' is being initialized.
Quote:
38, Loop carried scalar dependence for 'ix' at line 39
Loop carried scalar dependence for 'iy' at line 42
Loop carried scalar dependence for 'iy' at line 43
Inner sequential loop scheduled on host


Second, you linearize the 'randnums' array and use a computed index. The compiler can't tell if all index in a compute index are unique and wont parallelize this loop. Now, you can use the 'independent' clause, but it would be preferable to just make randnums a multi-dimensional array.

Finally, 'count' is never used on the GPU, so should be removed from the 'data region' directive. Otherwise, the GPU's copy (which is garbage) will overwrite the host's copy at the end of the data region.

Here's the full source with my changes. Of course, please double check that everything is correct.

- Mat

Code:
program main

!=====( Initialization )===========================!
    use accel_lib

    real, dimension(:), allocatable :: seeds
    real, dimension(:,:), allocatable :: randnums
    double precision, dimension(:), allocatable :: particles
    real, parameter :: PI = 4 * atan(1.0)
    integer :: numparticles, numseeds, seedinit, count(200)
    integer, parameter :: K4B=selected_int_kind(9)
   integer(K4B), parameter :: IA=16807,IM=2147483647,IQ=127773,IR=2836
   real :: am
   integer(K4B) :: gix,giy,gk,ix,iy,k

    write(*,*) "How many particles?"
    read(*,*) numparticles
    write(*,*) "Initial seed?"
    read(*,*) seedinit

    numseeds = int(sqrt(200.0 * numparticles)) + 1
    !future note: set some limit to the number of random numbers since the array might
    !not be able to hold all of it

    allocate(particles(numparticles))
   allocate(seeds(numseeds))
   allocate(randnums(numseeds,numseeds))

    call acc_init(acc_device_nvidia)

!=====( Generate Random Numbers )==================!
   !$acc data region local(particles,seeds,randnums)
      am = nearest(1.0,-1.0)/IM
      iy=ior(ieor(888889999,abs(seedinit)),1)
      ix=ieor(777755555,abs(seedinit))

       do j = 1, numseeds
         ix=ieor(ix,ishft(ix,13))
         ix=ieor(ix,ishft(ix,-17))
         ix=ieor(ix,ishft(ix,5))
         k=iy/IQ
         iy=IA*(iy-k*IQ)-IR*k
         if (iy < 0) iy=iy+IM
         seeds(j)=am*ior(iand(IM,ieor(ix,iy)),1)
      end do


   !$acc updatein(seeds)

   !$acc region
        do j = 1, numseeds
         giy=ior(ieor(888889999, int(1000 * seeds(j)) + 1), 1)
         gix=ieor(777755555, int(1000 * seeds(j)) + 1)

         do jj = 1, numseeds
            gix=ieor(gix,ishft(gix,13))
            gix=ieor(gix,ishft(gix,-17))
            gix=ieor(gix,ishft(gix,5))
            gk=giy/IQ
            giy=IA*(giy-gk*IQ)-IR*gk
            if (giy < 0) giy=giy+IM
            randnums(j,jj) = am*ior(iand(IM,ieor(gix,giy)),1)
         end do
        end do
   !acc end region

!=====( Main )=================================!
      !acc do vector(256), parallel, independent
   !acc region
      do j = 1, numparticles
         particles(j) = 1
      end do
   !$acc end region

   !$acc updateout(randnums)
   do i = 1,numseeds
      do j = 1,numseeds
         print *,randnums(i,j)
      enddo
   enddo

   do j = 1,200
        !$acc region do
         do jj = 1, numparticles
            if (randnums(j,jj) .lt. 0.1) particles(jj) = 0
         end do

         !$acc updateout(particles)
         do jj = 1,numparticles
            if (particles(jj) .eq. 1) count(j) = count(j) + 1
         enddo

         if (count(j) .eq. 0) goto 100
      enddo


100      continue

   !$acc end data region

   open(unit = 2, file = 'data.txt')

    write(2,1000) 0, numparticles

    do i = 1, 200
      write(2,1000) i, count(i)
      if (count(i) .eq. 0) exit
   end do

   write(2,*) "c   Cycle Number   Number of particles"

1000   format (i5,i10)

end program
Back to top
View user's profile
WmBruce



Joined: 18 May 2009
Posts: 14

PostPosted: Thu Jul 29, 2010 3:05 pm    Post subject: Reply with quote

Does updatein/out have the same process as a copyin/out directive of a region clause? In other words, is update faster, slower, or equally fast compared to copy?

Regarding my initialization of seeds, I do realize it is loop carry dependent. I placed a kernel clause in it hoping it would run it sequentially in the device but the compiler says Inner sequential loop scheduled on host. Since I would like to avoid data copies back and forth, how do you force the compiler to run a do loop sequentially on the accelerator? And also why is the compiler forcing all sequential loops into the cpu? Is it because cpu's do sequential faster?

And last, whats the difference between the vector and parallel clauses for loop directives? and why is 256 a magic number for vector?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Thu Jul 29, 2010 4:04 pm    Post subject: Reply with quote

Quote:
Does updatein/out have the same process as a copyin/out directive of a region clause? In other words, is update faster, slower, or equally fast compared to copy?
Yes, except it gives you control over when the copies occur.

Quote:
Since I would like to avoid data copies back and forth, how do you force the compiler to run a do loop sequentially on the accelerator?
You need to have at least one level of parallelization if order for the compiler to create a kernel. You could add an outer loop with a count of 1 and possibly get the compiler to schedule this on the GPU, but I would advise against it. Each thread processor on the GPU is actually quite slow when compared to the host CPU. The GPU performance is gained through massive parallelization. So any gain you get by not coping the 'seed' array, would be taken away by the very slow kernel performance.

Granted, I can't say this is true for all cases, but the experiments I've done all show huge losses of performance when running a single sequential kernel on a GPU.

Quote:
And also why is the compiler forcing all sequential loops into the cpu?
If you had a parallel outer loop, then the inner loop would be scheduled sequentially on the GPU. It's only because you don't have a parallel outer loop is it schedule on the host.

Quote:
Is it because cpu's do sequential faster?
This has been my experience.

Quote:
And last, whats the difference between the vector and parallel clauses for loop directives?


The GPU contains N number of multi-processors (MIMD), each containing M number of thread processors (SIMD). The "parallel" clause corresponds to MIMD and "vector" to SIMD. The values of N and M varies from card to card. The utility 'pgaccelinfo' will show how many are available on your card.

For full details, I would suggest reading Dr. Michael Wolfe's articles that can be found at http://www.pgroup.com/resources/accel.htm and http://www.pgroup.com/resources/articles.htm. This article is probably the best place to start: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm

Quote:
and why is 256 a magic number for vector?
It's not. The vector size will vary per kernel, but will be a multiple of 16.

Hope this helps,
Mat
Back to top
View user's profile
WmBruce



Joined: 18 May 2009
Posts: 14

PostPosted: Tue Aug 03, 2010 9:55 am    Post subject: Reply with quote

Thanks for the reply
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group