PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Non-stride-1 accesses for array 'X'
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
AGy



Joined: 06 Jul 2010
Posts: 8

PostPosted: Mon Jul 12, 2010 11:58 pm    Post subject: Non-stride-1 accesses for array 'X' Reply with quote

Hi,

I had found some information on this message but I have not fully understood what it means and what does it involve in the calculation process.

Does it takes time ? could we going over it ?

Is there someone who could give me more details on this message than in the PGI_user_guide and in the examples of this site.

thank in advance.
have a nice day.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Tue Jul 13, 2010 11:45 am    Post subject: Reply with quote

Hi AGy,

'Non-stride-1 accesses' means that the GPU threads wont be accessing contiguous data from your array. Although sometimes unavoidable, this can cause performance problems and you should review your code to determine if it can be restructured or if a different GPU schedule will help.

The details are a bit lengthy, but I try and be brief. An NVIDIA GPU is composed of several Multi-proccessors (MIMD). Each multi-processor is comprised of several thread processors (SIMD). On my Tesla, I have 15 MIMD each with 8 SIMD for a total of 240 thread processors. This varies from card to card. For details about your card, please run the utility 'pgaccelinfo'.

SIMD stands for 'Single Instruction Multiple Data' which means that all the threads running on the same multi-processor needs to execute the same instruction at the same time, although they each perform the instruction on different data. Note that a group of threads being run on a single multi-thread is called a 'Warp'.

So what happens if all the threads in a Warp try to access memory? If the memory is contiguous, the hardware is optimized so that the threads can all bring in their memory at the same time. If the memory is not contiguous, then only on thread at a time can access memory at a time while the other threads wait.

Hope this helps,
Mat
Back to top
View user's profile
AGy



Joined: 06 Jul 2010
Posts: 8

PostPosted: Wed Jul 21, 2010 5:51 am    Post subject: Reply with quote

Hi Mat,

Thank you for the last answer, it was really helpful, never the less I need an example.

Could you please, tell me on this enclosed short example if it is possible to avoid the "Non-stride-1 accesses for array 'X'".

Then why it is unavoidable or actually avoidable.

Indeed, I suppose that my array are well allocated and contiguous, that the reason why, I am surprised to see this message :


Code:

pgf95 -Minfo=all -ta=nvidia main_acc.f90
main:
     36, Generating copy(c(:,:))
         Generating copy(b(:,:))
         Generating copy(a(:,:))
     37, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     38, Loop is parallelizable
     41, Loop is parallelizable
     43, Loop is parallelizable
         Accelerator kernel generated
         38, !$acc do seq
         41, !$acc do vector(32)
         43, !$acc do parallel
             Non-stride-1 accesses for array 'c'
             Non-stride-1 accesses for array 'b'
             Non-stride-1 accesses for array 'a'
             CC 1.0 : 13 registers; 20 shared, 80 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 13 registers; 20 shared, 80 constant, 0 local memory bytes; 25 occupancy


The main_acc.f90 program is the following :

Code:

   PROGRAM main

   USE accel_lib

   IMPLICIT NONE

   INTEGER :: DEVICE_TYPE, nbr_DEVICES, i, j, k, n, v
   REAL :: temps_elapsed
   REAL, ALLOCATABLE, DIMENSION(:,:) :: A,B,C,D
   REAL, ALLOCATABLE, DIMENSION(:) :: KK
   REAL :: t1, t2
   INTEGER :: tt1, tt2, tts, tt_max, tinit

   DEVICE_TYPE = acc_device_nvidia

   call acc_init(DEVICE_TYPE)
   call acc_set_device(DEVICE_TYPE)
!   call acc_set_device_num(1,DEVICE_TYPE)

!   nbr_DEVICES = acc_get_device_num(DEVICE_TYPE)

!   WRITE(*,*) "DEVICES : ",nbr_DEVICES

   CALL SYSTEM_CLOCK(COUNT_RATE=tts, COUNT_MAX=tt_max)

   n=12000

   ALLOCATE(A(n,n))
   ALLOCATE(B(n,n))
   ALLOCATE(C(n,n))

   ! ================================================
   call cpu_time(t1)
   CALL SYSTEM_CLOCK(COUNT=tt1)
   
   !$acc data region copy(A,B,C)
   !$acc region do host(32)         ! 20 ms
   DO i=1,n

      !$acc do vector(32)
      DO j=1,100,n
         !$acc do parallel
         DO v=0,99
            A(i,j+v)=(i+j+v)/(i+j+v)
            B(i,j+v)=(i+j+v)/(i+j+v)
            C(i,j+v)=(i+j+v)/(i+j+v)
         ENDDO
      ENDDO

   ENDDO
   !$acc end data region

   CALL SYSTEM_CLOCK(COUNT=tt2)
   call cpu_time(t2)

   WRITE(*,*) 'cpu acc \t: ',(t2-t1)*1000.0,' ms'
   WRITE(*,*) 'clock acc \t: ',REAL(tt2-tt1)*1000.0/tts,' ms'
   WRITE(*,*) 'rapport \t: ',(t2-t1)*tts/(tt2-tt1)

   DEALLOCATE(A)
   DEALLOCATE(B)
   DEALLOCATE(C)

   END PROGRAM main


please note that my graphic card is the following :

Code:

pgaccelinfo
CUDA Driver Version:           3000

Device Number:                 0
Device Name:                   Tesla C2050
Device Revision Number:        2.0
Global Memory Size:            2817720320
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Initialization time:           1018 microseconds
Current free memory:           2719612928
Upload time (4MB):             1595 microseconds ( 973 ms pinned)
Download time:                 3086 microseconds (1460 ms pinned)
Upload bandwidth:              2629 MB/sec (4310 MB/sec pinned)
Download bandwidth:            1359 MB/sec (2872 MB/sec pinned)


thank in advance.
Have a nice day

AGy
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Wed Jul 21, 2010 10:37 am    Post subject: Reply with quote

Hi AGy,

Quote:

!$acc region do host(32) ! 20 ms
DO i=1,n

Fortran is column major meaning that the column data is store contiguous in memory. Hence, your 'i' loop needs to be the vector (aka SIMD, thread block, warp). Though you have 'i' scheduled on the host.

The simple fix is remove 'host' clause. Then the compiler will automatically schedule 'i' as the vector. (See below).

If in your real code does need to schedule the outer loop on the host (it isn't necessary for this example), then you would need to reorganize your data or invert your loops so that the array's first dimension is the vector and the second dimension is scheduled on the host. This is opposite of scheduling on a CPU using OpenMP.

Hope this helps,
Mat


Code:
% cat stride.f90
   PROGRAM main

   USE accel_lib

   IMPLICIT NONE

   INTEGER :: DEVICE_TYPE, nbr_DEVICES, i, j, k, n, v
   REAL :: temps_elapsed
   REAL, ALLOCATABLE, DIMENSION(:,:) :: A,B,C,D
   REAL, ALLOCATABLE, DIMENSION(:) :: KK
   REAL :: t1, t2
   INTEGER :: tt1, tt2, tts, tt_max, tinit

   DEVICE_TYPE = acc_device_nvidia

   call acc_init(DEVICE_TYPE)
   call acc_set_device(DEVICE_TYPE)
!   call acc_set_device_num(1,DEVICE_TYPE)

!   nbr_DEVICES = acc_get_device_num(DEVICE_TYPE)

!   WRITE(*,*) "DEVICES : ",nbr_DEVICES

   CALL SYSTEM_CLOCK(COUNT_RATE=tts, COUNT_MAX=tt_max)

   n=12000

   ALLOCATE(A(n,n))
   ALLOCATE(B(n,n))
   ALLOCATE(C(n,n))

   ! ================================================
   call cpu_time(t1)
   CALL SYSTEM_CLOCK(COUNT=tt1)

 !$acc data region copy(A,B,C)
 !$acc region do
   DO i=1,n
      DO j=1,100,n
         DO v=0,99
            A(i,j+v)=(i+j+v)/(i+j+v)
            B(i,j+v)=(i+j+v)/(i+j+v)
            C(i,j+v)=(i+j+v)/(i+j+v)
         ENDDO
      ENDDO

   ENDDO
!acc end region
!$acc end data region

   CALL SYSTEM_CLOCK(COUNT=tt2)
   call cpu_time(t2)

   WRITE(*,*) 'cpu acc \t: ',(t2-t1)*1000.0,' ms'
   WRITE(*,*) 'clock acc \t: ',REAL(tt2-tt1)*1000.0/tts,' ms'
   WRITE(*,*) 'rapport \t: ',(t2-t1)*tts/(tt2-tt1)

   DEALLOCATE(A)
   DEALLOCATE(B)
   DEALLOCATE(C)

   END PROGRAM main
% pgf90 -ta=nvidia -Minfo=accel stride.f90 -V10.6
main:
     36, Generating copy(c(:,:))
         Generating copy(b(:,:))
         Generating copy(a(:,:))
     37, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     38, Loop is parallelizable
     39, Loop is parallelizable
     40, Loop is parallelizable
         Accelerator kernel generated
         38, !$acc do parallel, vector(16)
         39, !$acc do parallel
         40, !$acc do vector(16)
             CC 1.0 : 17 registers; 24 shared, 84 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 17 registers; 24 shared, 84 constant, 0 local memory bytes; 75 occupancy
Back to top
View user's profile
WmBruce



Joined: 18 May 2009
Posts: 14

PostPosted: Thu Jul 29, 2010 2:50 pm    Post subject: Reply with quote

The description of SIMD, warps, etc. was really helpful, but I have a few questions. Does PGI allow simultaneously running different instructions on different multi-processors? And if yes, how?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group