|
| View previous topic :: View next topic |
| Author |
Message |
JMa
Joined: 30 Nov 2012 Posts: 14
|
Posted: Thu Jan 24, 2013 9:36 pm Post subject: openACC vs. CUDA |
|
|
Hi Mat and All,
I hate to ask this kind of stupid questions, but it just came across my mind and made me curious:
OpenACC is so much easier to impelment than CUDA, so will it be possible in future users may all drop CUDA and switch to OpenACC?
On the other hand, what are the major drawbacks of OpenACC, at leastly currently, compared with CUDA?
I would be very thankful if you can also kindly guide me to some articles/documents discussing abouth this.
Thanks,
JMa |
|
| Back to top |
|
 |
PaulPa
Joined: 02 Aug 2012 Posts: 35
|
Posted: Fri Jan 25, 2013 11:46 am Post subject: Re: openACC vs. CUDA |
|
|
Hi JMa,
these are kind of high-level question, so let me try to give you some high-level answers as well.
| JMa wrote: |
I hate to ask this kind of stupid questions, but it just came across my mind and made me curious:
OpenACC is so much easier to implement than CUDA, so will it be possible in future users may all drop CUDA and switch to OpenACC?
|
This is an interesting question and there doesn't seem to be just one answer to it. IMHO, I think that OpenACC will persist for at least the next few years (whether as OpenACC or as a part of OpenMP 4.0), however, it is likely that low-level programming models will still exist in the near future because they offer the programmer the possibility to highly tune it's application (and that's what HPC is about, right?).
On the other hand, OpenACC eases the way we program coprocessors. So I think that both approaches could benefit from each other. E.g.: Use OpenACC for the easy stuff and manually fine-tune some compute-intensive kernels with CUDA.
The only situation, that I can think of, which would make CUDA redundant would be if the compilers would become so powerful that they generate the same high-performance code that you could achieve with CUDA (or at least within some few %). I'm not a compiler engineer but it doubt that this is what we'll see in the next few years.
| JMa wrote: |
On the other hand, what are the major drawbacks of OpenACC, at lastly currently, compared with CUDA?
|
So there are some limitations as of right now:
- Function calls are not yet supported within parallel regions (unless they can be inlined)**
- No nested parallelism is allowed**
- ...
- Another limitation is that you as a programmer can not use CUDA intrinsic functions (e.g. warp functions) within your accelerator region. But this is the way it is suppossed to be for a directive-based approach - it should be easy to use and portable across different architectures (i.e. no intrinsics).
I hope that this answers some of your questions.
Best,
Paul
** These features will be implemented in OpenACC 2.0 (so there's hope :)) |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Jan 28, 2013 12:20 pm Post subject: |
|
|
Great answer Paul!
| Quote: | | The only situation, that I can think of, which would make CUDA redundant would be if the compilers would become so powerful that they generate the same high-performance code that you could achieve with CUDA (or at least within some few %). I'm not a compiler engineer but it doubt that this is what we'll see in the next few years. | For some codes we're already there, but for others we do have a ways to go (I'm still trying to figure out the performance issue you sent). OpenACC is still very new and has it's issues, but I for one am excited about the future. While I don't think OpenACC will ever fully replace the explicit programming models (i.e. CUDA, OpenCL), but given it's portability and easier access to programming accelerators, especially to non computer science majors, I think it definitely will be widely adopted.
- Mat |
|
| Back to top |
|
 |
PaulPa
Joined: 02 Aug 2012 Posts: 35
|
Posted: Mon Jan 28, 2013 1:45 pm Post subject: |
|
|
| mkcolg wrote: | Great answer Paul!
For some codes we're already there, but for others we do have a ways to go (I'm still trying to figure out the performance issue you sent). OpenACC is still very new and has it's issues, but I for one am excited about the future. While I don't think OpenACC will ever fully replace the explicit programming models (i.e. CUDA, OpenCL), but given it's portability and easier access to programming accelerators, especially to non computer science majors, I think it definitely will be widely adopted.
|
thanks, I couldn't agree more. OpenACC is a very promissing tool and I like to see how it will evolve in the future.
@Mat: I'm still curious about this performance issue. I'm currently waiting for pgcc 13.1 so that I can have a look if the problem still persists - feel free to spoil it, if you've tested it already :)
Best,
Paul |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Jan 28, 2013 4:06 pm Post subject: |
|
|
Hi Paul,
| Quote: | | if you've tested it already | I hadn't already since I was down in Austin attending a conference (I sat next to the new head of your computing center) to work on standardizing OpenACC benchmarks.
I just reran your code with 13.1 and indeed it appears that we solved what ever the problem was. I show the times approximately equal to the CUDA version. Overall it's down from 14.6 seconds to around 9.
Here's my PGI_ACC_TIME outputs from 12.10:
| Code: | ACC_NOTIFY=0 CG_MAX_ITER=10000 OMP_NUM_THREADS=1 nice -19 ./cg_ser ../fidap011.mtx
This Version uses OpenACC
min 13 max 90 avg 65.688335 entries in a row
padded entries:0 (0.000000)
PARSE DONE!
RESIDUAL:3298031056.460293
ENTERING CG
LEAVING CG
First 10 values of the solution vector x = (0:3.457127e-01' 1:6.113136e-01' 2:6.628127e-01' 3:5.463548e-01' 4:3.840186e-01' 5:5.858319e-01' 6:5.232209e-01' 7:5.619878e-01' 8:4.634887e-01' 9:5.931586e-01' )
Max Iterations:10000
Iterations: 10000
Solve time: 14.624010
Accelerator Kernel Timing data
./solver.c
axpy
45: region entered 20000 times
time(us): total=445,465 init=822 region=444,643
kernels=186,463
w/o init: total=444,643 max=51,596 min=17 avg=22
48: kernel launched 20000 times
grid: [130] block: [128]
time(us): total=186,463 max=228 min=7 avg=9
./solver.c
vectorDot
29: region entered 20001 times
time(us): total=1,691,386 init=1,054 region=1,690,332
kernels=317,837
w/o init: total=1,690,332 max=14,750 min=2 avg=84
30: kernel launched 19965 times
grid: [130] block: [128]
time(us): total=317,837 max=186 min=9 avg=15
./solver.c
vectorDot
21: region entered 20001 times
time(us): total=3,305,567 init=794 region=3,304,773
kernels=469,003
w/o init: total=3,304,773 max=91,778 min=149 avg=165
23: kernel launched 20001 times
grid: [130] block: [128]
time(us): total=285,686 max=190 min=12 avg=14
24: kernel launched 20001 times
grid: [1] block: [256]
time(us): total=183,317 max=251 min=7 avg=9
./solver.c
vectorDot
19: region entered 20001 times
time(us): total=5,888,556 init=811 region=5,887,745
kernels=725
w/o init: total=5,887,745 max=106,772 min=271 avg=294
30: kernel launched 36 times
grid: [130] block: [128]
time(us): total=725 max=180 min=13 avg=20
./solver.c
cg
164: region entered 1 time
time(us): total=81,531 init= region=81,530
kernels=19
w/o init: total=81,530 max=81,530 min=81,530 avg=81,530
166: kernel launched 1 times
grid: [130] block: [128]
time(us): total=19 max=19 min=19 avg=19
./solver.c
nrm2
101: region entered 1 time
time(us): total=67,259
kernels=45
104: kernel launched 1 times
grid: [130] block: [128]
time(us): total=21 max=21 min=21 avg=21
105: kernel launched 1 times
grid: [1] block: [256]
time(us): total=24 max=24 min=24 avg=24
./solver.c
xpay
57: region entered 10001 times
time(us): total=270,678 init=359 region=270,319
kernels=91,370
w/o init: total=270,319 max=71,409 min=17 avg=27
60: kernel launched 10001 times
grid: [130] block: [128]
time(us): total=91,370 max=52 min=7 avg=9
./solver.c
matvec
76: region entered 10001 times
time(us): total=7,829,004 init=436 region=7,828,568
kernels=7,431,620
w/o init: total=7,828,568 max=281,892 min=740 avg=782
80: kernel launched 10001 times
grid: [16614] block: [128]
time(us): total=7,431,620 max=911 min=730 avg=743
./solver.c
cg
154: region entered 1 time
time(us): total=14,623,965
data=3,992
acc_init.c
acc_init
38: region entered 1 time
time(us): init=523,977 |
Again with 13.1:
| Code: |
ACC_NOTIFY=0 CG_MAX_ITER=10000 OMP_NUM_THREADS=1 nice -19 ./cg_ser ../fidap011.mtx
This Version uses OpenACC
min 13 max 90 avg 65.688335 entries in a row
padded entries:0 (0.000000)
PARSE DONE!
RESIDUAL:3298031056.460293
ENTERING CG
LEAVING CG
First 10 values of the solution vector x = (0:3.457127e-01' 1:6.113136e-01' 2:6.628127e-01' 3:5.463548e-01' 4:3.840186e-01' 5:5.858319e-01' 6:5.232209e-01' 7:5.619878e-01' 8:4.634887e-01' 9:5.931586e-01' )
Max Iterations:10000
Iterations: 10000
Solve time: 9.302466
Accelerator Kernel Timing data
./solver.c
vectorDot NVIDIA devicenum=0
time(us): 645,419
23: kernel launched 20001 times
grid: [130] block: [128]
device time(us): total=312,471 max=278 min=10 avg=15
elapsed time(us): total=458,380 max=285 min=19 avg=22
23: reduction kernel launched 20001 times
grid: [1] block: [256]
device time(us): total=166,978 max=174 min=6 avg=8
elapsed time(us): total=315,601 max=182 min=14 avg=15
30: kernel launched 20001 times
grid: [130] block: [128]
device time(us): total=165,970 max=172 min=6 avg=8
elapsed time(us): total=316,702 max=509 min=13 avg=15
./solver.c
axpy NVIDIA devicenum=0
time(us): 173,763
48: kernel launched 20000 times
grid: [130] block: [128]
device time(us): total=173,763 max=180 min=6 avg=8
elapsed time(us): total=327,764 max=2,092 min=13 avg=16
./solver.c
xpay NVIDIA devicenum=0
time(us): 81,667
60: kernel launched 10001 times
grid: [130] block: [128]
device time(us): total=81,667 max=33 min=7 avg=8
elapsed time(us): total=159,996 max=634 min=14 avg=15
./solver.c
matvec NVIDIA devicenum=0
time(us): 6,971,615
80: kernel launched 10001 times
grid: [16614] block: [128]
device time(us): total=6,971,615 max=1,007 min=692 avg=697
elapsed time(us): total=7,049,936 max=1,753 min=699 avg=704
./solver.c
nrm2 NVIDIA devicenum=0
time(us): 26
104: kernel launched 1 times
grid: [130] block: [128]
device time(us): total=17 max=17 min=17 avg=17
elapsed time(us): total=25 max=25 min=25 avg=25
104: reduction kernel launched 1 times
grid: [1] block: [256]
device time(us): total=9 max=9 min=9 avg=9
elapsed time(us): total=16 max=16 min=16 avg=16
./solver.c
cg NVIDIA devicenum=0
time(us): 2,935
154: data copyin reached 5 times
device time(us): total=2,865 max=1,870 min=17 avg=573
166: kernel launched 1 times
grid: [130] block: [128]
device time(us): total=16 max=16 min=16 avg=16
elapsed time(us): total=29 max=29 min=29 avg=29
223: data copyout reached 1 times
device time(us): total=54 max=54 min=54 avg=54
|
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|