pgaccelinfo error code=999

OpenACC and CUDA Fortran
Post Reply
DavidGutzwiller
Posts: 16
Joined: Jan 30 2015

pgaccelinfo error code=999

Post by DavidGutzwiller » Wed Mar 18, 2020 3:23 pm

One of my colleagues is able to build with PGI19.4 on his local workstation but is encountering a crash at runtime. I was able to reproduce the same error with pgaccelinfo:

nint0112:~/BUILD21/> /common/pgi/linux86-64/19.4/bin/pgaccelinfo -v
CUDA Driver Version: 10020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020
could not initialize CUDA runtime, error code=999
No accelerators found.
Check the permissions on your CUDA device

Interestingly, nvidia-smi does not indicate any problems

nint0112:~/BUILD21> nvidia-smi
Wed Mar 18 23:19:42 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P2000 Off | 00000000:21:00.0 Off | N/A |
| 52% 45C P0 19W / 75W | 325MiB / 5050MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

It looks like the local nvidia drivers are quite new, version 440.64 with CUDA 10.2. I don't think this should be a problem for a PGI19.4 executable. Is this correct? I saw some other postings that mentioned some file permission issues, but I don't see any problems in this regard.

int0112:~/BUILD21> ls -lah /dev/nvidia0
crw-rw-rw- 1 root root 195, 0 Mär 18 08:22 /dev/nvidia0

Are you aware of any other workarounds for this issue?

Thanks,
David

mkcolg
Posts: 8319
Joined: Jun 30 2004

Re: pgaccelinfo error code=999

Post by mkcolg » Thu Mar 19, 2020 9:23 am

Hi David,

I don't think there's an incompatibility with using 19.4, which supports up to CUDA 10.1, and using a CUDA 10.2 driver. At least I didn't see any issues on a system with CUDA 10.2 driver, albeit a slightly older version, 440.33. Granted, there have been driver issues in the past, so it could be a specific problem with 440.64. I'll see if my IT folks can install 440.64 on a system for me to test.

Though, this looks similar to issues I've seen in the past where libcuda.so isn't installed properly in the system's lib directory (or maybe has the wrong permissions) and pgaccelinfo is picking up the OpenCL driver. Are you able to run a simple CUDA code?

Note that PGI 20.1 does support CUDA 10.2, so you might try updating the compiler version as well if it does turn out to be a CUDA 10.1 vs 10.2 compatibility issue.

-Mat

DavidGutzwiller
Posts: 16
Joined: Jan 30 2015

Re: pgaccelinfo error code=999

Post by DavidGutzwiller » Thu Mar 19, 2020 10:21 am

Hi Mat,

Thanks for the response. I tested a PGI 19.4 build of our solver on a separate node also running CUDA 10.2 and it worked, so indeed there does not seem to be a fundamental compatibility issue.

I'll check on libcuda and see if it has changed recently. Unfortunately I don't have root access this system so making changes will be painful. The developer reports that he was able to run his code a few days ago, but unfortunately it is not clear how he had his system configured at the time.

To be continued...

-David

Post Reply