PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

PGI Accelerator async clause results in error
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Tue May 29, 2012 6:58 am    Post subject: PGI Accelerator async clause results in error Reply with quote

Hi,
I use PGI Compiler 12.3 and try to use the async clause with PGI Accelerator directives (parallel regions and update) and different async handles (values 1 or 2). In some cases, my programs runs without any error. However, in a couple of times I get:
Code:
call to cuModuleGetGlobal returned error 500: Not found                                                                                                                               
CUDA driver version: 4020                                                                                                                                                             
Segmentation fault (core dumped)

Any ideas?
Thanks, Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Tue May 29, 2012 4:37 pm    Post subject: Reply with quote

Hi Sandra,

I have not seen this error before but my best guess is that you have a "reflected" clause in your code. For some reason when it tries to look up a variable, it doesn't exist yet. Maybe adding a "wait" directive would help.

What you can do is set the environment variable "NVDEBUG=1" to have the runtime print out every device call made. You can then seen when cuModuleGetGlobal is called and with which variables.

Granted, async is new so it could be a PGI issue as well. If you want, please send an example to PGI Customer Service and we can try an track down the error.

- Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Wed May 30, 2012 2:25 am    Post subject: Reply with quote

Hi Mat,
no, I don't have a reflected clause in my code, but all my arrays are "restrict". The structure of my code is:
Code:
int queue1 = 1
int queue2 = 2
#pragma acc data region vars
{
// more stuff
#pragma acc update device x

#pragma acc for region async(queue2)
#pragma acc for parallel
// loop with REDUCTION to "funval2"

#pragma acc for region copyin(bhat) async(queue1)
#pragma acc for parallel
// loop with REDUCTION to "funval1"

#pragma acc wait(queue1)
#pragma acc wait(queue2)
funval = funval1 + funval2

//more stuff
}

I am not sure whether I can really reduce it to send it to you. By the way, I call the code above several times in a row and I also have a loop within the data region that surrounds both kernel calls and the wait directive.

I tried to set NVDEBUG=1. But the output is overwhelming. If I get the seg fault, I will get output like the following:
Code:
[..]
                    __pgi_cu_init( file=<path>/pgiacc_eval_f.c, function=funval, line=93, startline=37, endline=177 )
__pgi_cu_module3( lineno=93 )
__pgi_cu_module3 module already loaded at 0x800b80
 0x0007d000 0x00000000 0x00200000 0x00000002 0x0007d000 0x00000000 0x00000000 0x40590000__pgi_cu_module_function( name=0x440d41=funval_96_gpu, lineno=96, argname=0x0=, argsize=72, SWcachesize=4459876 )

                     0xcccccccd 0x3ff4ccccFunction handle is 0x7f5f80
 0x1fa00000 0x00000002
__pgi_cu_module_function( name=0x440d50=funval_155_gpu_red, lineno=96, argname=0x0=, argsize=0, SWcachesize=0 )
dequeue operation 14:launch queue:2 position:109 dataoffset:192 datasize:208
Function handle is 0x817a60
__pgi_cu_launch_a(func=0x7fcd40, grid=1x1x1, block=256x1x1, lineno=84)
__pgi_cu_launch_a(func=0x7fcd40, params=0x8091f0, bytes=52, sharedbytes=0)
enqueuing operation 12:alloc flags:0x2 async:1 finfo[0x7fffa864e510:32] vinfo[(nil):0] frec:(nil)
enqueue operation 12:alloc queue[1] position[58] data[336]
First arguments are:
                    CPU waiting on ACC async queue:1
     512000          0    2097152          2     512000          0          0 1079574528
                     -858993459 1073007820  530579456          2       2000

                     0x0007d000 0x00000000 0x00200000 0x00000002 0x0007d000 0x00000000 0x00000000 0x40590000
                     0xcccccccd 0x3ff4cccc 0x1fa00000 0x00000002 0x000007d0
dequeue operation 12:alloc queue:1 position:58 dataoffset:336 datasize:32
__pgi_cu_alloc(size=1024,lineno=93,name=bhat)
__pgi_cu_alloc(1024) returns 0x21fb00000
dequeue operation 11:downloadC queue:2 position:110 dataoffset:400 datasize:48
__pgi_cu_downloadc( "b1", size=8, offset=0, lineno=80 )
call to cuModuleGetGlobal returned error 500: Not found
CUDA driver version: 4020
CPU done waiting on ACC async queue:1
enqueuing operation 6:upload flags:0x2 async:1 finfo[0x7fffa864e4e0:48] vinfo[0x7fffa864e600:48] frec:(nil)
enqueue operation 6:upload queue[1] position[59] data[368]
enqueuing operation 3:datadone flags:0x2 async:1 finfo[(nil):0] vinfo[(nil):0] frec:(nil)
enqueue operation 3:datadone queue[1] position[60] data[464]
__pgi_cu_alloc(size=8,lineno=96,name=)
__pgi_cu_alloc(8) returns 0x21fb00400
enqueuing operation 8:uploadC flags:0x2 async:1 finfo[0x7fffa864e4f0:40] vinfo[(nil):0] frec:(nil)
enqueue operation 8:uploadC queue[1] position[61] data[464]
enqueuing operation 14:launch flags:0x2 async:1 finfo[0x7fffa864e450:136] vinfo[0x7fffa864e660:72] frec:(nil)
enqueue operation 14:launch queue[1] position[62] data[512]
enqueuing operation 14:launch flags:0x2 async:1 finfo[0x7fffa864e450:136] vinfo[0x7fffa864e660:76] frec:(nil)
enqueue operation 14:launch queue[1] position[63] data[736]
enqueuing operation 11:downloadC flags:0x2 async:1 finfo[0x7fffa864e4f0:40] vinfo[(nil):0] frec:(nil)
enqueue operation 11:downloadC queue[1] position[64] data[960]
CPU waiting on ACC async queue:1
make: *** [run1] Segmentation fault (core dumped)

Could you help me in reading and interpreting that?
I see that the second kernel is called with "__pgi_cu_alloc". But I don't understand "__pgi_cu_downloadc("b1",size=8,offset=0,lineno=80)". Line 80 correspond to the for-loop in my first kernel, but I don't have a variable called "b1". Might that be the a temporary variable because of the reduction?

Thanks. Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Wed May 30, 2012 10:47 am    Post subject: Reply with quote

Thanks Sandra, the NVDEBUG output was very helpful.

We think know what's wrong. "b1" is the struct to hold the argument list to your kernels. "cuModuleGetGlobal" is called to get the address of "b1" so the kernel can access it. In asynchronous mode there are multiple modules so we think the wrong module handle is being used.

I've created a problem report (TPR#18723) and sent on to engineering. If you can send us a reproducing example that would be great since this will allow us to confirm this is indeed the problem and allow us to verify a fix, but we understand if you can't.

Best Regards,
Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Wed May 30, 2012 11:16 am    Post subject: Reply with quote

FYI, the workaround would be to disable asynchronous via setting the environment variable "ACC_SYNCHRONOUS" to 1.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group