PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Time for coping array to device

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
shyboy_6104



Joined: 17 Oct 2011
Posts: 5

PostPosted: Wed Apr 18, 2012 6:30 pm    Post subject: Time for coping array to device Reply with quote

I am writing a program. In this program, I want to copy a sub-array from host to device. But the performance is poor. So I test the memory copy performance.
Code:

   module test   
      use cudafor
   contains
      attributes(global) subroutine cal_I(a,b)
          real :: a(90,90,90,3),b(90,90,90,3)
      
      end subroutine cal_I
   end module test
   
   program main
   use cudafor
   use test
         
   type ( cudaEvent ) :: startEvent , stopEvent, StepStartEvt, StepStopEvt
   real(4) :: time
   integer :: istat
   
   real :: a(100,100,100,3),b(100,100,100,3)
   real :: a1(90,90,90,3),b1(90,90,90,3)
   real, device :: dev_a1(90,90,90,3), dev_b1(90,90,90,3)
   integer :: i
   a=1.0
   b=2.0
      
   istat = cudaEventCreate ( startEvent )
   istat = cudaEventCreate ( stopEvent )
   
   istat = cudaEventRecord ( startEvent , 0)
   
   dev_a1(:,:,:,:) = a(1:90,1:90,1:90,:)
   dev_b1(:,:,:,:) = b(1:90,1:90,1:90,:)
   
   istat = cudaEventRecord ( stopEvent , 0)
   istat = cudaEventSynchronize ( stopEvent )
   istat = cudaEventElapsedTime (time , startEvent , stopEvent )
   write (* ,*) ' Time for copy1: ', time
   
   istat = cudaEventRecord ( startEvent , 0)
   do i=1,3
   dev_a1(:,:,:,i) = a(1:90,1:90,1:90,i)
   dev_b1(:,:,:,i) = b(1:90,1:90,1:90,i)
   enddo
   istat = cudaEventRecord ( stopEvent , 0)
   istat = cudaEventSynchronize ( stopEvent )
   istat = cudaEventElapsedTime (time , startEvent , stopEvent )
   write (* ,*) ' Time for copy2: ', time
   
   istat = cudaEventRecord ( startEvent , 0)
   dev_a1 = a1
   dev_b1 = b1
   istat = cudaEventRecord ( stopEvent , 0)
   istat = cudaEventSynchronize ( stopEvent )
   istat = cudaEventElapsedTime (time , startEvent , stopEvent )
   write (* ,*) ' Time for copy3: ', time
   
   istat = cudaEventRecord ( startEvent , 0)
   do i=1,3
   dev_a1(:,:,:,i) = a1(:,:,:,i)
   dev_b1(:,:,:,i) = b1(:,:,:,i)
   enddo
   istat = cudaEventRecord ( stopEvent , 0)
   istat = cudaEventSynchronize ( stopEvent )
   istat = cudaEventElapsedTime (time , startEvent , stopEvent )
   write (* ,*) ' Time for copy4: ', time
      
   call cal_I<<<1>>>(dev_a1,dev_b1)

   end program main

In the first case, I do not use loop to copy subarrays, while in the second case, I use loop to copy subarrays.
In the 3rd case, I do not use loop to copy array, while in the 4th case, I use loop to copy array. I run the program on M2050, and get the result.
Code:

Time for copy1:     28.94928   
 Time for copy2:     44.00723   
 Time for copy3:     4.118048   
 Time for copy4:     43.05971

According to the result, the time for copy the entire array is least. So should I copy the subarray to a array which is on host, then copy the entire array to device?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5943
Location: The Portland Group Inc.

PostPosted: Thu Apr 19, 2012 9:34 am    Post subject: Reply with quote

Hi shyboy_6104,

There is a significant amount of overhead in copying data to and from the GPU. Hence, reducing the frequency of copies is important. However, since DMA transfers must be on contiguous data, copying sub-arrays often requires the compiler to create implicit DO loops that only copy small contiguous blocks.

In the case of example 1,2, and 4, the compiler is most likely generating a triply nested DO loop. However, example 1 is copying three elements at a time, while 2 and 4 are copying one element at at time.

My recommendation when copying sub-arrays is to do something similar as you do in example #3. Create a temp host array having the same size as the sub-array (i.e. a1 and b1), gather the sub-array into the temp array, and then copy the temp array to the device in one contiguous block. Use a similar scatter method when copying back to the host.

Though, if you have device the memory available, it may be just as fast to copy the entire original arrays and not worry about the gather and scatter operations.

Hope this helps,
Mat
Back to top
View user's profile
shyboy_6104



Joined: 17 Oct 2011
Posts: 5

PostPosted: Mon Apr 23, 2012 7:37 pm    Post subject: Reply with quote

Hi, Mat.
The recommendation is helpful. Thanks.
In my program, I need to copy irregular arrays to device memory, so I need a loop to do this.
I found a new method, which is using pointer.
Code:

do i=1,3
      cdvx = C_devLOC(dev_a1(1,1,1,i))
      chvx = c_loc(a1(1,1,1,i))
       istat=cudaMemcpy(cdvx,chvx, 90*90*90*4, cudaMemcpyHostToDevice)
       cdvx = C_devLOC(dev_b1(1,1,1,i))
      chvx = c_loc(b1(1,1,1,i))
       istat=cudaMemcpy(cdvx,chvx, 90*90*90*4, cudaMemcpyHostToDevice)
   enddo

Comparing with example #3, there is a little overhead with the new method.
mkcolg wrote:
Hi shyboy_6104,

My recommendation when copying sub-arrays is to do something similar as you do in example #3. Create a temp host array having the same size as the sub-array (i.e. a1 and b1), gather the sub-array into the temp array, and then copy the temp array to the device in one contiguous block. Use a similar scatter method when copying back to the host.

Hope this helps,
Mat
[/code]
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group