PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Shared memory approach with OpenACC

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
lechat



Joined: 14 Oct 2012
Posts: 10

PostPosted: Tue Oct 16, 2012 9:46 pm    Post subject: Shared memory approach with OpenACC Reply with quote

Hi,

I'm trying to use shared memory to cache things with OpenACC.

Basically what I'm working on is a matrix multiplication, and what I have is this:

Code:
typedef float ff;

// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
          const restrict ff* b,
          restrict ff* c,
          const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{

#pragma acc region
{

#pragma acc loop independent vector(16)
  for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
    for (int j = 0; j < n; ++j) {
      ff sum = 0;
      for (int k = 0; k < n; ++k) {
        sum += a[i + n * k] * b[k + n * j];
      }
      c[i + n * j] = sum;
    }
  }

}
}
}


What I would like to do is use shared memory to cache a tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.

I understand I can use the

Code:
#pragma acc cached


directive, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.

Basically on CUDA I would know the exact size of my blocks, and would be able to:
    declare a shared memory with the size of the block
    copy the 'relevant' part of the data to the block
    use this data


Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

Thanks
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Oct 17, 2012 2:45 pm    Post subject: Reply with quote

Hi lechat,

This will be tough with this code. The cache directive only works on contiguous data segments so you can only use it on B but caching B wont help. What you want here is for the reused portions of "a" to be coalesced into a shared memory, but unfortunately it doesn't do this. You might be able to block the code but I'm not sure it would help much.

Note, that unless you're using a C1060, shared memory usage is not as critical. Newer NVIDIA cards have hardware caching.

- Mat
Back to top
View user's profile
lechat



Joined: 14 Oct 2012
Posts: 10

PostPosted: Wed Oct 17, 2012 7:24 pm    Post subject: Reply with quote

Hi Mat, thanks for you reply!

I have a GTX 280, and although I have a good performance compared with the CPU, I'm still about 10x slower than CUBLAS (I've included my timing information below).

(size, serial, acc, cublas)
(256, '0.014000', '0.002000', '0.000000')
(512, '0.112000', '0.014000', '0.006000')
(768, '0.380000', '0.028000', '0.006000')
(1024, '0.900000', '0.068000', '0.012000')
(1280, '1.732000', '0.118000', '0.022000')
(1536, '2.996000', '0.206000', '0.034000')
(1792, '4.748000', '0.294000', '0.048000')
(2048, '7.092000', '0.446000', '0.068000')
(2304, '10.080000', '0.634000', '0.090000')
(2560, '13.820000', '0.868000', '0.122000')
(2816, '18.382000', '1.186000', '0.158000')
(3072, '23.878000', '1.462000', '0.196000')
(3328, '30.337999', '2.002000', '0.244000')
(3584, '37.897999', '2.414000', '0.296000')
(3840, '46.722000', '3.160000', '0.358000')
(4096, '56.584000', '3.220000', '0.428000')

What I was looking for was to make my ACC implementation more like the CUDA implementation (which uses shared memory for caching).

Any tips on how to improve my performance in this case would be greatly appreciated.

Thanks
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Oct 18, 2012 10:55 am    Post subject: Reply with quote

Hi lechat,

When I duplicate you're code, my OpenACC performance is roughly double of what you show (0.34 seconds for the 2304 size with your schedule, 0.29 with the schedule shown below). Granted, I'm using a C2070 but don't think that accounts for all of the difference. What's the output when you compile with "-Minfo=accel"?

Note that you put the "restrict" keyword in the wrong place so the compiler may not be optimizing your code as well as it could. Also, you can get better performance by making the "i" loop use a longer vector length.

Here's my version of this routine:
Code:
void mmul(int size, float * restrict A, float * restrict B, \
            float * restrict C)
{

  // Compute matrix multiplication.
#pragma acc kernels copyin(A[0:size*size],B[0:size*size]) \
  copyout(C[0:size*size])
  {
#pragma acc loop gang vector(256)  independent
    for (int i = 0; i < size; ++i) {
#pragma acc loop gang vector(2) independent
      for (int j = 0; j < size; ++j) {
            float tmp = 0.0;
            for (int k = 0; k < size; ++k) {
               tmp += A[i + k*size] * B[k + j*size];
            }
            C[i + j*size] = tmp;
         }
    }
  }
}

Quote:
I'm still about 10x slower than CUBLAS (I've included my timing information below).
I would expect CUBLAS to be faster since it's been hand optimized. However 10x seems a bit high. Let see if you can get better performance using the version above. Also, are you including data transfer time in your CUBLAS timings? At least for me, my OpenACC time does include data transfer times.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group