PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

"ECC error" and device to host data transfer quest
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Dec 14, 2011 11:40 am    Post subject: Reply with quote

Hi Nicola,

Can you please give more details on the problem?

When I run your program the output is approximately 1 when i_dbg is set to 5, and 8 when it's set to 40. Since i_dbg gathers the total amount of kernel time, having the time be 8 times more when you increase the number of iterations by 8, makes sense.

- Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Thu Dec 15, 2011 1:02 am    Post subject: Reply with quote

Actually, the time "t" displayed as output is the value computed by the program in order to predict the evolution of the solid transport. What I was talking about is the time required to copy "t" 's value from the device to the host.

I've uploaded a modified versione of the program which includes a timing for the copyout operation (called "copyout_time").

http://www.mediafire.com/?xaymuvavzopymup


I would expect the time required to copy a single double precision value from the device to the host to be almost the same after 5 cycles or after 40 cycles (the dimension of the data should be the same).

Also, the times to export the variable "time_dev" are very large if compared to the kernel's exection time. Shouldn't they be smaller? If you run the code on your PC, what times (to copy the time value from the device to the host) do you get?


Thanks again,

Nicola
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Dec 15, 2011 10:31 am    Post subject: Reply with quote

Hi Nicola,

Kernel calls are asynchronous (i.e. the host code continues to execute after a kernel is launched). The host doesn't block until the "time=time_dev" data copy is reached. Hence, you're not timing the data copy but rather the kernel execution time plus the data transfer.

To fix, add a call to "cudaThreadSynchronize" before you start the data transfer timer.
Code:

if(index==i_dbg) then
        ierr = cudaThreadSynchronize()
        call system_clock(start_copyout)

        time = time_dev

        call system_clock(end_copyout)


Before the change:
Code:

****** DEBUG DT *******************
 t:                             8.700117769630859     
 Time required for copyout:     16.58562088012695     


After the change:
Code:

 ****** DEBUG DT *******************
 t:                             8.700117769630859     
 Time required for copyout:    2.9000000722589903E-005


A better method to determine GPU times is to use CUDA event timers or profiling. In this case, if I set the environment variable CUDA_PROFILE to 1, we can see the actual time of the data transfer.

The original code has the profile of:
Code:
method=[ memcpyDtoH ] gputime=[ 1.952 ] cputime=[ 7859871.000 ]

So the actual GPU time is only a few microseconds but the CPU time is nearly 8 seconds. In other words, the call was blocked waiting for the kernels to finish. Note that the profiling itself is periodically blocking the host code accounting for the ~9 second difference.

The modified profile shows near identical GPU time, but now only the CPU time to transfer the data.
Code:

method=[ memcpyDtoH ] gputime=[ 1.984 ] cputime=[ 16.000 ]


Hope this explains things,
Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Fri Dec 16, 2011 3:32 am    Post subject: Reply with quote

Hi Mat,

thanks for your answers. Since the slow-down is not due by the device to host transfer, I tried to profile the code.

The PGI Profiler is meant for codes developed with Accelerator clauses, so I downloaded the nVidia CUDA Visual Profiler. I tried several times to profile the .exe obrained in "Release mode" (VS2008) with no success. In particular, I did the following steps:

- Compile the code;
- Put the .exe file inside a folder with the input datas and the cudart64_32_16.dll (if the last file is missing, I get an error);
- Start the Profiler;
- Either "Create..." or "Profile Application...";
- Locate the .exe file in the "Launch:" field;
- Remove the "Max Execution Time" value.

What I got is:

- Most of the times, the video drivers crash (I'm on Windows 7 x64 SP1 with 276 drivers);
- Once or twice, I manage to complete the 15 steps (even if the video driver crashes 2 or 3 times) but only the first 4 or 5 kernels get profiled.

Where did I go wrong? I uploaded the compiled file on MediaFire:

http://www.mediafire.com/?6sp678c2ct6ojjm

Instead of nVidia Profiler, should I use some cuda events in order to profile the program?


Thanks (again) in adavnce,

Nicola
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Dec 21, 2011 11:40 am    Post subject: Reply with quote

Hi Nicola,

Quote:
The PGI Profiler is meant for codes developed with Accelerator clauses
No, the profiler can be used CUDA Fortran. How are you collecting the profile information? From a PGI shell, try running the "pgcollect" utility with the "-cuda" flag.

From pgcollect -help:
Code:

Profiling of Accelerator/GPU Events:
-cuda[=gmem|branch|cfg:<cfgpath>|cc10|cc11|cc12|cc13|cc20|list]
                    Collect performance data from CUDA-enabled GPU
    gmem            Global memory access statistics
    branch          Branching and Warp statistics
    cfg:<cfgpath>   Specifies <cfgpath> as CUDA profile config file
    cc10            Use counters for compute capability 1.x
    cc11            Use counters for compute capability 1.x
    cc12            Use counters for compute capability 1.x
    cc13            Use counters for compute capability 1.x
    cc20            Use counters for compute capability 2.0
    list            List cuda event names used in profile config file
-cudainit           Initialize CUDA driver to eliminate overhead from profile


Quote:
so I downloaded the nVidia CUDA Visual Profiler.

Sorry I've never used NVIDIA's profiler, so don't really know how to solve this problem. Hopefully someone else can step in.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group