PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Hitting card memory limit crashes code and hangs card

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Mon Apr 12, 2010 5:27 am    Post subject: Hitting card memory limit crashes code and hangs card Reply with quote

Hi.

I am having a problem with GPUs being left in a hung state. The code I am running deploys multiple GPUs within an openMP parallel loop. The code works fine for a single thread, however when running multiple cards I sometimes get a crash due to memory problems. This would be fine, except that often one or more cards get left in a hung state. Rebooting the machine fixes this, however the machine is part of a cluster so this is not a very satisfactory solution. Is there some way of restarting graphics cards from the command line if they end up in a jammed state?

Rob.
Back to top
View user's profile
dholt



Joined: 30 Jul 2008
Posts: 15
Location: The Portland Group Inc.

PostPosted: Mon Apr 12, 2010 10:51 am    Post subject: Reply with quote

Hi Rob,

If you're in Linux, you can try running (as root):

Code:
modprobe -vr nvidia


to unload the driver, followed by:

Code:
modprobe -v nvidia


to start it again.

You will need to kill off any running processes using the GPU before hand, and in some cases if something has the GPU tied up you won't be able to unload the driver. At that point the only solution I know of is to restart the system.
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Mon Apr 12, 2010 11:00 am    Post subject: Reply with quote

Great - thanks for this. I have managed to persuade our system administrator to give me root access for now. Bit annoying that they hang so often - is this common?

Rob.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Mon Apr 12, 2010 11:11 am    Post subject: Reply with quote

While I don't know if it's common for others, my device hangs periodically as well. Like you, it occurs more often when I'm running code that hits a device limit or encounters some other error.

- Mat
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Tue Apr 13, 2010 3:45 am    Post subject: Reply with quote

Hi Mat. Bit annoying, but now I have root access I can reboot as is necessary. My next thread shows why I keep hitting problems....

Thanks,

Rob.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group