PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

cuda-x86 documentation
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
Todd Martinez



Joined: 31 Aug 2009
Posts: 3

PostPosted: Sun Aug 12, 2012 3:17 pm    Post subject: cuda-x86 documentation Reply with quote

Is there any detailed documentation for cuda-x86? What features of CUDA are not supported? How are threads and thread blocks mapped to CPU cores and threads? Will a CUDA program that also uses POSIX threads compile correctly under cuda-x86 or will the pthreads calls interfere with the thread mapping done by cuda-x86? I am compiling a major (> 1million lines) CUDA code under cuda-x86 and find that there apparently is a problem with pthreaded CUDA codes. I removed the pthreading in the CUDA code and was able to compile and run without seg faults, but the results are incorrect. I need to figure out first how to compile and run the code correctly, but after that I will also need to understand how to make it perform well. So I need to understand the cuda-x86 implementation much better than what is given in the documentation I have found so far (PGIInsider articles, etc.). Finally, I see that some form of textures is supported (since the simpleTexture example works), but I need to know the limits of this -- our CUDA code uses textures heavily and I am wondering if that is part of the problem.
Back to top
View user's profile
brentl



Joined: 20 Jul 2004
Posts: 108

PostPosted: Mon Aug 13, 2012 9:43 am    Post subject: Reply with quote

Hi Todd,

The low-level implementation is not documented well at this time. Probably the best way to understand the approach we took is to look at the file pgi_cuda_x86.h which is provided in and used from the release include directory.

When a kernel can be optimized, we basically run an entire CUDA threadblock on a single thread, and run the blocks work-shared within an OMP parallel region, collapsed over the gridDim dimensions.

When a kernel cannot be optimized, we run OMP_NUM_THREADS x86 threads which spawn blockDim tasks, and then extend our OMP runtime to allow the tasks to synchronize, in support of CUDA __syncthreads() calls.

The compiler -Minfo option will inform you which CUDA functions could be
optimized and which could not, and hopefully why not.

The main thrust of the work on CUDA-x86 was completed around the CUDA 3.2 timeframe. We have kept the base functionality working as new CUDA versions have been released, but have not added any of the new CUDA features as of this time. It is on our roadmap to update CUDA-x86
once CUDA 5.0 comes out.

The texture support should be solid. We run a suite of tests that compare our results versus CUDA for normalized, wrapped, clamped, linear filtering, 1, 2, and 3 dimension lkups and 1, 2, and 4 return values. It is possible we are missing some cases, but let us know what you find.

Yes, explicit pthread calls and our OpenMP runtime can conflict and cause failures. This is a known issue. The safest approach, for now, is to port the pthreads portions to OpenMP.

We'd be happy to work with you to get your code ported and make CUDA-x86 a better product.
Back to top
View user's profile
Todd Martinez



Joined: 31 Aug 2009
Posts: 3

PostPosted: Mon Aug 13, 2012 1:46 pm    Post subject: How many threads get used in cudax86? Reply with quote

OK, this is helpful and will get back to you with further questions. Did get everything working last night (for single precision -- not sure why but still have troubles with double). However, now it tells me:

Warning: Number of emulated threads (1) is less than available cpus (8)

Is there a way to tell it to use more emulated threads? Or is this a consequence of the grid/block/thread structure I used? (I set it to 1 thread/block to make everything work)

Thanks,
Todd
Back to top
View user's profile
brentl



Joined: 20 Jul 2004
Posts: 108

PostPosted: Mon Aug 13, 2012 5:19 pm    Post subject: Reply with quote

We print that warning just in case you are debugging using pgdbg. Not every x86 thread will hit a breakpoint, and it is usually a frustrating experience.

It is probably a result of your launch configuration. You can dynamically adjust the number of x86 threads that emulate the CUDA threads by explicitly calling omp_set_num_threads, if you want. By default, when we compile CUDA-x86, it is as if you used -mp=allcores, and we default to running a number of x86 threads equal to the number of cores.
Back to top
View user's profile
Todd Martinez



Joined: 31 Aug 2009
Posts: 3

PostPosted: Mon Aug 13, 2012 10:49 pm    Post subject: Performance issues with cuda-x86 Reply with quote

What is the expected performance of cuda-x86? The code in question has also been ported to OpenCL (no changes to the code structure, just adding all the OCL headers, etc.). In that case, the performance on an Intel Xeon was 15x slower than what is observed on an nvidia GPU. But with cudax86, we are seeing performance that is almost 1000x slower than on a single GPU (same as above, Fermi class). What is the range of performance figures you see in real world codes (e.g., more than 50 CUDA kernels)? Is this kind of performance hit believable, or does this suggest there are some simple optimizations?
Thanks,
Todd
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group