PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

openACC vs. CUDA
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
JMa



Joined: 30 Nov 2012
Posts: 22

PostPosted: Thu Jan 24, 2013 9:36 pm    Post subject: openACC vs. CUDA Reply with quote

Hi Mat and All,

I hate to ask this kind of stupid questions, but it just came across my mind and made me curious:
OpenACC is so much easier to impelment than CUDA, so will it be possible in future users may all drop CUDA and switch to OpenACC?
On the other hand, what are the major drawbacks of OpenACC, at leastly currently, compared with CUDA?
I would be very thankful if you can also kindly guide me to some articles/documents discussing abouth this.

Thanks,

JMa
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Fri Jan 25, 2013 11:46 am    Post subject: Re: openACC vs. CUDA Reply with quote

Hi JMa,

these are kind of high-level question, so let me try to give you some high-level answers as well.

JMa wrote:

I hate to ask this kind of stupid questions, but it just came across my mind and made me curious:
OpenACC is so much easier to implement than CUDA, so will it be possible in future users may all drop CUDA and switch to OpenACC?


This is an interesting question and there doesn't seem to be just one answer to it. IMHO, I think that OpenACC will persist for at least the next few years (whether as OpenACC or as a part of OpenMP 4.0), however, it is likely that low-level programming models will still exist in the near future because they offer the programmer the possibility to highly tune it's application (and that's what HPC is about, right?).
On the other hand, OpenACC eases the way we program coprocessors. So I think that both approaches could benefit from each other. E.g.: Use OpenACC for the easy stuff and manually fine-tune some compute-intensive kernels with CUDA.

The only situation, that I can think of, which would make CUDA redundant would be if the compilers would become so powerful that they generate the same high-performance code that you could achieve with CUDA (or at least within some few %). I'm not a compiler engineer but it doubt that this is what we'll see in the next few years.

JMa wrote:

On the other hand, what are the major drawbacks of OpenACC, at lastly currently, compared with CUDA?


So there are some limitations as of right now:
- Function calls are not yet supported within parallel regions (unless they can be inlined)**
- No nested parallelism is allowed**
- ...
- Another limitation is that you as a programmer can not use CUDA intrinsic functions (e.g. warp functions) within your accelerator region. But this is the way it is suppossed to be for a directive-based approach - it should be easy to use and portable across different architectures (i.e. no intrinsics).

I hope that this answers some of your questions.

Best,
Paul

** These features will be implemented in OpenACC 2.0 (so there's hope :))
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Mon Jan 28, 2013 12:20 pm    Post subject: Reply with quote

Great answer Paul!

Quote:
The only situation, that I can think of, which would make CUDA redundant would be if the compilers would become so powerful that they generate the same high-performance code that you could achieve with CUDA (or at least within some few %). I'm not a compiler engineer but it doubt that this is what we'll see in the next few years.
For some codes we're already there, but for others we do have a ways to go (I'm still trying to figure out the performance issue you sent). OpenACC is still very new and has it's issues, but I for one am excited about the future. While I don't think OpenACC will ever fully replace the explicit programming models (i.e. CUDA, OpenCL), but given it's portability and easier access to programming accelerators, especially to non computer science majors, I think it definitely will be widely adopted.

- Mat
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Mon Jan 28, 2013 1:45 pm    Post subject: Reply with quote

mkcolg wrote:
Great answer Paul!

For some codes we're already there, but for others we do have a ways to go (I'm still trying to figure out the performance issue you sent). OpenACC is still very new and has it's issues, but I for one am excited about the future. While I don't think OpenACC will ever fully replace the explicit programming models (i.e. CUDA, OpenCL), but given it's portability and easier access to programming accelerators, especially to non computer science majors, I think it definitely will be widely adopted.


thanks, I couldn't agree more. OpenACC is a very promissing tool and I like to see how it will evolve in the future.

@Mat: I'm still curious about this performance issue. I'm currently waiting for pgcc 13.1 so that I can have a look if the problem still persists - feel free to spoil it, if you've tested it already :)

Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Mon Jan 28, 2013 4:06 pm    Post subject: Reply with quote

Hi Paul,

Quote:
if you've tested it already
I hadn't already since I was down in Austin attending a conference (I sat next to the new head of your computing center) to work on standardizing OpenACC benchmarks.

I just reran your code with 13.1 and indeed it appears that we solved what ever the problem was. I show the times approximately equal to the CUDA version. Overall it's down from 14.6 seconds to around 9.

Here's my PGI_ACC_TIME outputs from 12.10:
Code:
ACC_NOTIFY=0 CG_MAX_ITER=10000 OMP_NUM_THREADS=1 nice -19 ./cg_ser ../fidap011.mtx
This Version uses OpenACC
min 13 max 90 avg 65.688335 entries in a row
padded entries:0 (0.000000)
PARSE DONE!
RESIDUAL:3298031056.460293
ENTERING CG
LEAVING CG
First 10 values of the solution vector x = (0:3.457127e-01' 1:6.113136e-01' 2:6.628127e-01' 3:5.463548e-01' 4:3.840186e-01' 5:5.858319e-01' 6:5.232209e-01' 7:5.619878e-01' 8:4.634887e-01' 9:5.931586e-01' )
Max Iterations:10000
Iterations: 10000
Solve time: 14.624010

Accelerator Kernel Timing data
./solver.c
  axpy
    45: region entered 20000 times
        time(us): total=445,465 init=822 region=444,643
                  kernels=186,463
        w/o init: total=444,643 max=51,596 min=17 avg=22
        48: kernel launched 20000 times
            grid: [130]  block: [128]
            time(us): total=186,463 max=228 min=7 avg=9
./solver.c
  vectorDot
    29: region entered 20001 times
        time(us): total=1,691,386 init=1,054 region=1,690,332
                  kernels=317,837
        w/o init: total=1,690,332 max=14,750 min=2 avg=84
        30: kernel launched 19965 times
            grid: [130]  block: [128]
            time(us): total=317,837 max=186 min=9 avg=15
./solver.c
  vectorDot
    21: region entered 20001 times
        time(us): total=3,305,567 init=794 region=3,304,773
                  kernels=469,003
        w/o init: total=3,304,773 max=91,778 min=149 avg=165
        23: kernel launched 20001 times
            grid: [130]  block: [128]
            time(us): total=285,686 max=190 min=12 avg=14
        24: kernel launched 20001 times
            grid: [1]  block: [256]
            time(us): total=183,317 max=251 min=7 avg=9
./solver.c
  vectorDot
    19: region entered 20001 times
        time(us): total=5,888,556 init=811 region=5,887,745
                  kernels=725
        w/o init: total=5,887,745 max=106,772 min=271 avg=294
        30: kernel launched 36 times
            grid: [130]  block: [128]
            time(us): total=725 max=180 min=13 avg=20
./solver.c
  cg
    164: region entered 1 time
        time(us): total=81,531 init= region=81,530
                  kernels=19
        w/o init: total=81,530 max=81,530 min=81,530 avg=81,530
        166: kernel launched 1 times
            grid: [130]  block: [128]
            time(us): total=19 max=19 min=19 avg=19
./solver.c
  nrm2
    101: region entered 1 time
        time(us): total=67,259
                  kernels=45
        104: kernel launched 1 times
            grid: [130]  block: [128]
            time(us): total=21 max=21 min=21 avg=21
        105: kernel launched 1 times
            grid: [1]  block: [256]
            time(us): total=24 max=24 min=24 avg=24
./solver.c
  xpay
    57: region entered 10001 times
        time(us): total=270,678 init=359 region=270,319
                  kernels=91,370
        w/o init: total=270,319 max=71,409 min=17 avg=27
        60: kernel launched 10001 times
            grid: [130]  block: [128]
            time(us): total=91,370 max=52 min=7 avg=9
./solver.c
  matvec
    76: region entered 10001 times
        time(us): total=7,829,004 init=436 region=7,828,568
                  kernels=7,431,620
        w/o init: total=7,828,568 max=281,892 min=740 avg=782
        80: kernel launched 10001 times
            grid: [16614]  block: [128]
            time(us): total=7,431,620 max=911 min=730 avg=743
./solver.c
  cg
    154: region entered 1 time
        time(us): total=14,623,965
                  data=3,992
acc_init.c
  acc_init
    38: region entered 1 time
        time(us): init=523,977


Again with 13.1:
Code:

ACC_NOTIFY=0 CG_MAX_ITER=10000 OMP_NUM_THREADS=1 nice -19 ./cg_ser ../fidap011.mtx
This Version uses OpenACC
min 13 max 90 avg 65.688335 entries in a row
padded entries:0 (0.000000)
PARSE DONE!
RESIDUAL:3298031056.460293
ENTERING CG
LEAVING CG
First 10 values of the solution vector x = (0:3.457127e-01' 1:6.113136e-01' 2:6.628127e-01' 3:5.463548e-01' 4:3.840186e-01' 5:5.858319e-01' 6:5.232209e-01' 7:5.619878e-01' 8:4.634887e-01' 9:5.931586e-01' )
Max Iterations:10000
Iterations: 10000
Solve time: 9.302466

Accelerator Kernel Timing data
./solver.c
  vectorDot  NVIDIA  devicenum=0
        time(us): 645,419
        23: kernel launched 20001 times
            grid: [130]  block: [128]
             device time(us): total=312,471 max=278 min=10 avg=15
            elapsed time(us): total=458,380 max=285 min=19 avg=22
        23: reduction kernel launched 20001 times
            grid: [1]  block: [256]
             device time(us): total=166,978 max=174 min=6 avg=8
            elapsed time(us): total=315,601 max=182 min=14 avg=15
        30: kernel launched 20001 times
            grid: [130]  block: [128]
             device time(us): total=165,970 max=172 min=6 avg=8
            elapsed time(us): total=316,702 max=509 min=13 avg=15
./solver.c
  axpy  NVIDIA  devicenum=0
        time(us): 173,763
        48: kernel launched 20000 times
            grid: [130]  block: [128]
             device time(us): total=173,763 max=180 min=6 avg=8
            elapsed time(us): total=327,764 max=2,092 min=13 avg=16
./solver.c
  xpay  NVIDIA  devicenum=0
        time(us): 81,667
        60: kernel launched 10001 times
            grid: [130]  block: [128]
             device time(us): total=81,667 max=33 min=7 avg=8
            elapsed time(us): total=159,996 max=634 min=14 avg=15
./solver.c
  matvec  NVIDIA  devicenum=0
        time(us): 6,971,615
        80: kernel launched 10001 times
            grid: [16614]  block: [128]
             device time(us): total=6,971,615 max=1,007 min=692 avg=697
            elapsed time(us): total=7,049,936 max=1,753 min=699 avg=704
./solver.c
  nrm2  NVIDIA  devicenum=0
        time(us): 26
        104: kernel launched 1 times
            grid: [130]  block: [128]
             device time(us): total=17 max=17 min=17 avg=17
            elapsed time(us): total=25 max=25 min=25 avg=25
        104: reduction kernel launched 1 times
            grid: [1]  block: [256]
             device time(us): total=9 max=9 min=9 avg=9
            elapsed time(us): total=16 max=16 min=16 avg=16
./solver.c
  cg  NVIDIA  devicenum=0
        time(us): 2,935
        154: data copyin reached 5 times
             device time(us): total=2,865 max=1,870 min=17 avg=573
        166: kernel launched 1 times
            grid: [130]  block: [128]
             device time(us): total=16 max=16 min=16 avg=16
            elapsed time(us): total=29 max=29 min=29 avg=29
        223: data copyout reached 1 times
             device time(us): total=54 max=54 min=54 avg=54
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group