PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Implicit async memcpy (updated with an example)

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
pilot117



Joined: 25 Jun 2012
Posts: 8

PostPosted: Tue Jul 10, 2012 10:10 am    Post subject: Implicit async memcpy (updated with an example) Reply with quote

Hi,

In the fortran code, if I only use the openAcc, and the "mirror" to pass global device arrays across several subroutines, any updates from host to device using "update" WITHOUT "async", it generates only one stream.

However, when I pass the flag "-Mcuda" into the compiler, without "update", there is still one stream. But when use "update" (even without "async"), there are two streams launched and one stream is responsible for the memory transfer.

Why is that? How can force one stream but still use -Mcuda?

P.S: In my application, this async memory transfer can not reduce the time since no computations and memcpys are overlapped. Moreover, this two streams result in some overhead which is not small enough compared with the pure computation.



-----------------------------updated with an example here---------------------
Below is an fotran 77 sample code called simpleTest.f. My compile command line is
Code:
pgf90 -o test -acc -mp -ta=nvidia:cc2.0,time -Minfo=accel -Mcuda -Mvect simpleTest.f




Code:

subroutine mysub()
      integer N,i
      parameter (N=1048576)
      common/blk/t1(N),t2(N),t3(N)
!$acc update device(t1,t2,t3)
!$acc kernels
!$acc loop
      do i=1,N
         t2(i)=t1(i)*t1(i)+t2(i)*t3(N)
      enddo
!$acc end kernels
!$acc update host(t2)
      return
      end

      program mainTest
      integer N,i
      parameter (N=1048576)
      real t1(1:N),t2(1:N),t3(1:N)
      common/blk/t1,t2,t3

!$acc mirror(t1,t2,t3)

      do i=1,N
         CALL RANDOM_NUMBER(HARVEST=X)
         t1(i)=X
         CALL RANDOM_NUMBER(HARVEST=X)
         t2(i)=X
         CALL RANDOM_NUMBER(HARVEST=X)
         t3(i)=X
      enddo
 

      do j=1,10
      call mysub()
      end do

      end program mainTest



When you run the execute "test" using nvvp, you will see two streams like


however, if I can change
Code:
t2(i)=t1(i)*t1(i)+t2(i)*t3(i)

It uses one stream again.





thanks!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Thu Jul 12, 2012 1:32 pm    Post subject: Reply with quote

Hi pilot117,

Sorry of the delayed response. I needed to ask one of our compiler engineers about this. It turns out to be same problem as one we found internally a few weeks ago with stream assignment and asynchronous data movement. The correct behaviour here is to use a single stream. We will have this corrected in our next release.

Thanks!
Mat
Back to top
View user's profile
pilot117



Joined: 25 Jun 2012
Posts: 8

PostPosted: Wed Aug 08, 2012 3:54 pm    Post subject: Reply with quote

Hi, Mat,

I tried this simple example with the 12.6 compiler. It still runs two streams as before and the profiling timeline is similar as the one shown above.

Any idea?

thanks!

mkcolg wrote:
Hi pilot117,

Sorry of the delayed response. I needed to ask one of our compiler engineers about this. It turns out to be same problem as one we found internally a few weeks ago with stream assignment and asynchronous data movement. The correct behaviour here is to use a single stream. We will have this corrected in our next release.

Thanks!
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group