PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Watchdog timer kills CUDA code
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
dcwarren



Joined: 18 Jun 2012
Posts: 29

PostPosted: Tue May 07, 2013 12:10 pm    Post subject: Watchdog timer kills CUDA code Reply with quote

Hi,

I'm attempting to get a CUDA Fortran code running on Windows, which means dealing with the watchdog timer that kills any GPU thread that lasts longer than some amount of time.

The card I'm using for computation is not device 0 on the machine, and it is not hooked up to any monitors. As such, I'm a bit surprised that the watchdog timer consistently kills my program after a few seconds.

I've already broken down the GPU part of the code into the smallest reasonable units of work, so I can't make any gains there. What are my options?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Tue May 07, 2013 2:00 pm    Post subject: Reply with quote

Quote:
The card I'm using for computation is not device 0 on the machine, and it is not hooked up to any monitors. As such, I'm a bit surprised that the watchdog timer consistently kills my program after a few seconds.
I'm surprised as well, since the Watchdog timer should only kill processes on devices with an attached monitor. Try running with the environment variable "PGI_ACC_TIME=1" and double check that the program isn't accidentally using device 0. You can also set the device number using the environment variable "ACC_DEVICE_NUM=1".

Other then that, you need to start hacking the registry to disable the watchdog timer.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running
http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx

Note, that I saw this post from someone with a similar issue. However, no one from NVIDIA has answered it yet.
https://forums.geforce.com/default/topic/531745/two-gpu-39-s-still-getting-windows-watchdog-timer/

- Mat
Back to top
View user's profile
dcwarren



Joined: 18 Jun 2012
Posts: 29

PostPosted: Wed May 08, 2013 7:07 am    Post subject: Reply with quote

Thanks, Mat.

I have a write statement at the top of the program that tells me I'm using GPU #1, not GPU #0, so I know I've got the correct one. And pgaccelinfo tells me that both cards have their execution times limited. Looks like it's registry-editing time for me!

Do you guys have a contact at NVIDIA you could bug about this? It seems like it's definitely not a problem on PGI's end, but something a bit deeper.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Wed May 08, 2013 10:00 am    Post subject: Reply with quote

Quote:
Do you guys have a contact at NVIDIA you could bug about this? It seems like it's definitely not a problem on PGI's end, but something a bit deeper.
Sure, let me ping Mark Harris who answered the stackoverflow forum question.

- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Mon May 13, 2013 8:14 am    Post subject: Reply with quote

Here's the response I received back from my contacts at NVIDIA:

---------------------------------------------------------------------------------------
On Windows Vista and later, the watchdog timer applies to all WDDM devices, regardless of whether there is a display attached. For someone hitting the timeouts, they have three choices:

(1) Use a TCC-capable board (e.g., a Tesla) and enable TCC mode with nvidia-smi.
(2) Increase the watchdog timeout in the registry (I prefer this over disabling the timeout completely). A timeout of, say, 30-60 seconds is enough to let most valid cases complete but still reset without rebooting in cases of a true hang.
(3) Change the kernels -- or rather the batches of kernels, which are a little hard to predict under WDDM -- so they always finish inside the default two seconds maximum.

If one of these solutions is implemented and the app still hangs/TDR's, then it could be a legitimate deadlock condition in the application code, the compiler-generated code, or the NVIDIA driver, in that order of likelihood.
----------------------------------------------------------------------------------

My best guess is that your device is set to use WDDM (Windows Display Driver Model) instead of TCC (Tesla Compute Cluster) mode. Here's some documentation I found on how to swtich modes: http://http.developer.nvidia.com/ParallelNsight/2.1/Documentation/UserGuide/HTML/Content/Tesla_Compute_Cluster.htm.

If you are using a non-Tesla card (such as a GTX or Quadro), then your best option would be to increase the Watchdog time out.

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group