View previous topic :: View next topic 
Author 
Message 
elliotmh
Joined: 22 Oct 2010 Posts: 1

Posted: Fri Oct 29, 2010 6:42 pm Post subject: Sparse Matrix Vector Multiplication 


Hi,
I'm new to gpu programming but am very interested in speeding up our conjugate gradient algorithm. With a simple preconditioner the computational bottleneck should be the sparse matrix vector multiplication. I've been experimenting in order to get the CRS matrix vector products efficiently implemented on the gpu. When I use the following code I find that one core on the cpu is almost 5 times faster then on the gpu (i7 960 vs gts 450 with about 1 million nnz in A). I was hoping that when I perform many spMV operations with the same matrix on different vectors I could see a significant speedup. Any suggestions would be greatly appreciated.
Thanks
nitt=1000
!$acc data region local(t,y)
do j=1,nitt ! itterations
!$acc region
do i = 1, nrow ! SpMV products
t=0
do k = ia(i), ia(i+1)1
t=t+a(k)*x(ja(k))
enddo
y(i)=t
enddo
x=y ! To test set x to be something new each iteration
!$acc end region
enddo
!$acc end data region
Here is the output from compiling the code. It has chosen to vectorize the loop over the rows.
pgfortran o f1.exe f1.f90 ta=nvidia Minfo=accel fast
NOTE: your trial license will expire in 7 days, 5.63 hours.
NOTE: your trial license will expire in 7 days, 5.63 hours.
main:
47, Generating local(y(:))
Generating local(t)
49, Generating copyin(ja(:))
Generating copyin(x(:))
Generating copyout(x(1:111872))
Generating copyin(a(:))
Generating copyin(ia(1:111873))
Generating compute capability 1.3 binary
50, Loop is parallelizable
Accelerator kernel generated
50, !$acc do parallel, vector(256)
Cached references to size [257] block of 'ia'
CC 1.3 : 16 registers; 1052 shared, 132 constant, 0 local memory by
tes; 100 occupancy
52, Loop is parallelizable
57, Loop is parallelizable
Accelerator kernel generated
57, !$acc do parallel, vector(256)
CC 1.3 : 4 registers; 20 shared, 112 constant, 0 local memory bytes
; 100 occupancy 

Back to top 


mkcolg
Joined: 30 Jun 2004 Posts: 6492 Location: The Portland Group Inc.

Posted: Mon Nov 01, 2010 12:38 pm Post subject: 


Hi elliotmh,
One thing that will help would be to add a "copyin(ia,ja,a)" and "copy(x)" to your data region. This should help reduce the amount of data copied. Right now you're copying "a", "ja", and "ia" in and copying "x" in and out 1000 times.
The compiler is able to recognize reductions and generate an optimized parallel reduction kernel. However, since one GPU kernel can't be launched from within another, the parallel reduction can't be generated here. This is why only the i loop is being accelerated.
A secondary issue is that loops must be rectangular in shape in order to be accelerated (i.e. the loop bounds must be know before entering a region). Since the kloop's bounds is determined within the body, this loop is not able to be accelerated.
Another issue is your memory accesses of "a", "ja" and "x". Ideally you want to access memory using the vector loop index (which is "i" in this case). Otherwise you get memory divergence (See: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm)
Finally, your compute intensity is very low. Compute intensity is the ratio of computation over data movement. With only one multiple and and one add and no data reuse, the code may not be suitable for a GPU.
How big is nrow (the iloop trip count) and the average size of ia(i) to ia(i+1) (the kloop trip count)?
If the iloop is large and kloop is small, then keep the code as is and just fix the memory copys in the data region.
if the iloop is small and kloop is large, move the acc region around the kloop only. This will help with memory divergence and may give you enough parallelization to overcome the low compute intensity.
If both the i and j loops are large, try recoding to some thing like the following:
Code: 
real, dimension(:,:), allocatable :: temp
allocate(temp(nrow,maxval))
....
nitt=1000
!$acc data region local(t,y),copy(x), copyin(ia,ja,a)
do j=1,nitt ! itterations
!$acc region
temp=0
do i = 1, nrow ! SpMV products
do k = 1, maxval
if (k .ge. ia(i) .and. k .lt. ia(i)) then
temp(i,j)=a(k)*x(ja(k))
endif
enddo
enddo
do i = 1, nrow ! SpMV products
y(i)=0
do k = 1, maxval
y(i) = y(i) + temp(i,j)
enddo
enddo
x=y ! To test set x to be something new each iteration
!$acc end region
enddo
!$acc end data region 
Hope this helps,
Mat 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © phpBB Group
