PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Sparse Matrix Vector Multiplication

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
elliotmh



Joined: 22 Oct 2010
Posts: 1

PostPosted: Fri Oct 29, 2010 6:42 pm    Post subject: Sparse Matrix Vector Multiplication Reply with quote

Hi,

I'm new to gpu programming but am very interested in speeding up our conjugate gradient algorithm. With a simple preconditioner the computational bottleneck should be the sparse matrix vector multiplication. I've been experimenting in order to get the CRS matrix vector products efficiently implemented on the gpu. When I use the following code I find that one core on the cpu is almost 5 times faster then on the gpu (i7 960 vs gts 450 with about 1 million nnz in A). I was hoping that when I perform many spMV operations with the same matrix on different vectors I could see a significant speedup. Any suggestions would be greatly appreciated.

Thanks

nitt=1000

!$acc data region local(t,y)
do j=1,nitt ! itterations

!$acc region
do i = 1, nrow ! SpMV products
t=0
do k = ia(i), ia(i+1)-1
t=t+a(k)*x(ja(k))
enddo
y(i)=t
enddo
x=y ! To test set x to be something new each iteration

!$acc end region
enddo
!$acc end data region


Here is the output from compiling the code. It has chosen to vectorize the loop over the rows.

pgfortran -o f1.exe f1.f90 -ta=nvidia -Minfo=accel -fast
NOTE: your trial license will expire in 7 days, 5.63 hours.
NOTE: your trial license will expire in 7 days, 5.63 hours.
main:
47, Generating local(y(:))
Generating local(t)
49, Generating copyin(ja(:))
Generating copyin(x(:))
Generating copyout(x(1:111872))
Generating copyin(a(:))
Generating copyin(ia(1:111873))
Generating compute capability 1.3 binary
50, Loop is parallelizable
Accelerator kernel generated
50, !$acc do parallel, vector(256)
Cached references to size [257] block of 'ia'
CC 1.3 : 16 registers; 1052 shared, 132 constant, 0 local memory by
tes; 100 occupancy
52, Loop is parallelizable
57, Loop is parallelizable
Accelerator kernel generated
57, !$acc do parallel, vector(256)
CC 1.3 : 4 registers; 20 shared, 112 constant, 0 local memory bytes
; 100 occupancy
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Mon Nov 01, 2010 12:38 pm    Post subject: Reply with quote

Hi elliotmh,

One thing that will help would be to add a "copyin(ia,ja,a)" and "copy(x)" to your data region. This should help reduce the amount of data copied. Right now you're copying "a", "ja", and "ia" in and copying "x" in and out 1000 times.

The compiler is able to recognize reductions and generate an optimized parallel reduction kernel. However, since one GPU kernel can't be launched from within another, the parallel reduction can't be generated here. This is why only the i loop is being accelerated.

A secondary issue is that loops must be rectangular in shape in order to be accelerated (i.e. the loop bounds must be know before entering a region). Since the k-loop's bounds is determined within the body, this loop is not able to be accelerated.

Another issue is your memory accesses of "a", "ja" and "x". Ideally you want to access memory using the vector loop index (which is "i" in this case). Otherwise you get memory divergence (See: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm)

Finally, your compute intensity is very low. Compute intensity is the ratio of computation over data movement. With only one multiple and and one add and no data reuse, the code may not be suitable for a GPU.

How big is nrow (the i-loop trip count) and the average size of ia(i) to ia(i+1) (the k-loop trip count)?

If the i-loop is large and k-loop is small, then keep the code as is and just fix the memory copys in the data region.

if the i-loop is small and k-loop is large, move the acc region around the k-loop only. This will help with memory divergence and may give you enough parallelization to overcome the low compute intensity.

If both the i and j loops are large, try recoding to some thing like the following:
Code:

real, dimension(:,:), allocatable :: temp
allocate(temp(nrow,maxval))
....
nitt=1000

!$acc data region local(t,y),copy(x), copyin(ia,ja,a)
do j=1,nitt ! itterations

!$acc region
temp=0
do i = 1, nrow ! SpMV products
do k = 1, maxval
if (k .ge. ia(i) .and. k .lt. ia(i)) then
temp(i,j)=a(k)*x(ja(k))
endif
enddo
enddo

do i = 1, nrow ! SpMV products
y(i)=0
do k = 1, maxval
   y(i) = y(i) + temp(i,j)
enddo
enddo

x=y ! To test set x to be something new each iteration

!$acc end region
enddo
!$acc end data region


Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group