|
| View previous topic :: View next topic |
| Author |
Message |
Dolf
Joined: 22 Mar 2012 Posts: 78
|
Posted: Mon Dec 17, 2012 1:40 pm Post subject: RE: |
|
|
| Quote: |
-Mcuda=cc13, though you don't really need the cc13.
|
what does c13 switch do? how can I add that to Visual Studio 2010 property when I compile on windows?
| Quote: |
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
|
does that mean that I can use block of (512,512,1) in Tesla C1060 instead of (32,16,1) I am currently using in GeForce?
| Quote: |
(32,16,1) That will work, though I've found a 16x16 typically works better.
|
what do you mean works better? does it mean faster? since that is the most important factor to my code.
many thanks,
Dolf |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Dec 17, 2012 5:23 pm Post subject: |
|
|
"cc13" sets which compute capability (CC) to target. You can find which device supports which CC here: http://en.wikipedia.org/wiki/CUDA#Supported_GPUs
| Quote: | | does that mean that I can use block of (512,512,1) in Tesla C1060 instead of (32,16,1) I am currently using in GeForce? | No, the product of the three dimensions can not exceed the maximum threads per block (i.e. 512). So you can have a (512,1,1), or (1,512,1), or (32,16,1), or (8,1,64), etc, but not (512,512,1).
| Quote: | | what do you mean works better? does it mean faster? since that is the most important factor to my code. | Using the maximum number of threads may not always produce the fastest code. It's best to try a variety of schedules to see which one is optimal.
- Mat |
|
| Back to top |
|
 |
Dolf
Joined: 22 Mar 2012 Posts: 78
|
Posted: Wed Dec 19, 2012 5:22 pm Post subject: RE: |
|
|
Perfect, I will try couple of schedules and see.
also Mat, I wanted to ask you if we can call a kernel, from another kernel, and how can we perform that.
so basically, I have two nested do loops, inside them I have a subroutine which have another two nested loops, I want to run all 4 loops into the GPU to increase speed.
please advice.
Dolf |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Dec 20, 2012 12:12 pm Post subject: |
|
|
| Quote: | | also Mat, I wanted to ask you if we can call a kernel, from another kernel, and how can we perform that. | In CUDA Fortran, kernels can call "device" attributed routines. For dynamic parallelism, i.e. when one kernel calls another global kernel using the chevron syntax, we are in the process of adding this to the 13.x compilers (though you need a K20 device).
For OpenACC, you currently must inline routines (either manually or via compiler inlining) into compute regions (i.e. there is no true calling support). However, the proposed OpenACC 2.0 specification does add the "routine" directive to help aid in this and we are looking into ways to do this automatically by the compiler. However, these features wont be available till later this year.
- Mat |
|
| Back to top |
|
 |
Dolf
Joined: 22 Mar 2012 Posts: 78
|
Posted: Thu Dec 20, 2012 1:13 pm Post subject: RE: |
|
|
Thanks Mat, I will keep looking for it.
cheers,
Dolf |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|