|
| View previous topic :: View next topic |
| Author |
Message |
dcwarren
Joined: 18 Jun 2012 Posts: 29
|
Posted: Tue May 07, 2013 12:10 pm Post subject: Watchdog timer kills CUDA code |
|
|
Hi,
I'm attempting to get a CUDA Fortran code running on Windows, which means dealing with the watchdog timer that kills any GPU thread that lasts longer than some amount of time.
The card I'm using for computation is not device 0 on the machine, and it is not hooked up to any monitors. As such, I'm a bit surprised that the watchdog timer consistently kills my program after a few seconds.
I've already broken down the GPU part of the code into the smallest reasonable units of work, so I can't make any gains there. What are my options? |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 5000 Location: The Portland Group Inc.
|
|
| Back to top |
|
 |
dcwarren
Joined: 18 Jun 2012 Posts: 29
|
Posted: Wed May 08, 2013 7:07 am Post subject: |
|
|
Thanks, Mat.
I have a write statement at the top of the program that tells me I'm using GPU #1, not GPU #0, so I know I've got the correct one. And pgaccelinfo tells me that both cards have their execution times limited. Looks like it's registry-editing time for me!
Do you guys have a contact at NVIDIA you could bug about this? It seems like it's definitely not a problem on PGI's end, but something a bit deeper. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 5000 Location: The Portland Group Inc.
|
Posted: Wed May 08, 2013 10:00 am Post subject: |
|
|
| Quote: | | Do you guys have a contact at NVIDIA you could bug about this? It seems like it's definitely not a problem on PGI's end, but something a bit deeper. | Sure, let me ping Mark Harris who answered the stackoverflow forum question.
- Mat |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 5000 Location: The Portland Group Inc.
|
Posted: Mon May 13, 2013 8:14 am Post subject: |
|
|
Here's the response I received back from my contacts at NVIDIA:
---------------------------------------------------------------------------------------
On Windows Vista and later, the watchdog timer applies to all WDDM devices, regardless of whether there is a display attached. For someone hitting the timeouts, they have three choices:
(1) Use a TCC-capable board (e.g., a Tesla) and enable TCC mode with nvidia-smi.
(2) Increase the watchdog timeout in the registry (I prefer this over disabling the timeout completely). A timeout of, say, 30-60 seconds is enough to let most valid cases complete but still reset without rebooting in cases of a true hang.
(3) Change the kernels -- or rather the batches of kernels, which are a little hard to predict under WDDM -- so they always finish inside the default two seconds maximum.
If one of these solutions is implemented and the app still hangs/TDR's, then it could be a legitimate deadlock condition in the application code, the compiler-generated code, or the NVIDIA driver, in that order of likelihood.
----------------------------------------------------------------------------------
My best guess is that your device is set to use WDDM (Windows Display Driver Model) instead of TCC (Tesla Compute Cluster) mode. Here's some documentation I found on how to swtich modes: http://http.developer.nvidia.com/ParallelNsight/2.1/Documentation/UserGuide/HTML/Content/Tesla_Compute_Cluster.htm.
If you are using a non-Tesla card (such as a GTX or Quadro), then your best option would be to increase the Watchdog time out.
Hope this helps,
Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|