Interpret out-of-bounds address and recover null names

OpenACC and CUDA Fortran
Post Reply
Shine X.M. Zhai
Posts: 31
Joined: Apr 22 2019

Interpret out-of-bounds address and recover null names

Post by Shine X.M. Zhai » Wed May 15, 2019 2:21 pm

Hello,

I used cuda-memcheck to debug a seemingly memory corruption issue in a C++ and OpenACC code. The output is representative of a memory access error:

Invalid __global__ read of size 8
========= at 0x00000060 in filename+line number
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x070c40b0 is out of bounds

According to the manual (https://docs.nvidia.com/cuda/cuda-memcheck/index.html), the address 0x070c40b0 should be a device address. So my first question is: is there a way to interpre/map back the address to the variable in the code?

I tried to use acc_present_dump to check whether the various variables had been allocated device memory. However, not all the variables have their names properly detected. For example:

host:0x5e6f8a0 device:0x7fa4c6a00200 size:4 presentcount:0+1 line:-1 name:iCondTableClip_
host:0x6af63e0 device:0x7fa4ab201400 size:160 presentcount:0+1 line:221 name:_T75523488_48240
host:0x6ce0200 device:0x7fa4ab200200 size:40 presentcount:0+1 line:-1 name:(null)

Here the 1st line is fine. The 2nd line is not meaningful and the 3rd line is simply null. My second question is: is there a way to retain the name information?

Thanks,
Shine

mkcolg
Posts: 8023
Joined: Jun 30 2004

Re: Interpret out-of-bounds address and recover null names

Post by mkcolg » Wed May 15, 2019 6:06 pm

Hi Shine,

"0x070c40b0" is more likely a host address not a device.
The 2nd line is not meaningful
It's a managed name for a particular "this" pointer.
3rd line is simply null
Null is used when the variable doesn't have a name such as when allocating memory for a pointer contained in an aggregate.

Since the problem is more likely an issue where a host address is getting used on the device, the present table probably wont help much.

Do you know which compute region has the issue? If not, try setting "PGI_ACC_NOTIFY=1" to have the runtime print out the kernel launches. It should narrow down which kernel is causing the error and therefore which variable is causing the problem.

Do you have an example of the problem code you could post?

Best Regards,
Mat

Shine X.M. Zhai
Posts: 31
Joined: Apr 22 2019

Re: Interpret out-of-bounds address and recover null names

Post by Shine X.M. Zhai » Wed May 15, 2019 6:34 pm

Hi Mat,

Thanks for your reply! Could you elaborate a bit more on how I could distinguish a host/device address in the future? The "0x070c40b0" has 10 digits, while addresses printed by acc_present_dump (such as host:0x5e6f8a0 device:0x7fa4c6a00200 size:4 presentcount:0+1 line:-1 name:iCondTableClip_) show that host address has 9 digits while device address has 14 digits. Is the number of digits relevant? (this is question 1)

As to your question: yes, I know which compute region where the problem occurred, and the "-ta=nvidia:lineinfo" compiler flag told me which variable should be to blame. However, I think I had correctly allocated device memory for the host pointer, because information from acc_present_dump did show memory was allocated on device. One thing though: the bytes information (by acc_present_dump) does not perfectly agree with the various data structures in my class (but somewhat close). I suppose this is due to data alignment, is that possible? (this is question 2)

I am now inclined to think that perhaps my class instance was not properly exposed (or made 'present') to the GPU, and as a result the pointers in the class are only shallow copies (even though I provided mechanisms for manual deep copy). "-Minfo=accel" shows that right before the parallel region, there is an implicit copyin (this), which probably should not happen? (this is question 3)

The problem code is quite big, and has heavy inheritance. I will have to re-produce a simpler code to post. I actually have some further questions regarding outputs from acc_present_dump, but for your easy reply I will postpone asking them for now.

Thanks,
Shine

mkcolg
Posts: 8023
Joined: Jun 30 2004

Re: Interpret out-of-bounds address and recover null names

Post by mkcolg » Thu May 16, 2019 9:04 am

The "0x070c40b0" has 10 digits, while addresses printed by acc_present_dump (such as host:0x5e6f8a0 device:0x7fa4c6a00200
The first address is 8-bytes (32-bits) so would be a host address (the small memory model, which is default, uses 32-bit offset from a base pointer address for addressing) while the device address is 12-bytes (48-bits). CUDA device pointers are 48-bits. The starting "0x" just indicates that the format is using hexadecimal notation.
One thing though: the bytes information (by acc_present_dump) does not perfectly agree with the various data structures in my class (but somewhat close). I suppose this is due to data alignment, is that possible? (this is question 2)
How are you determining the host address to the class? They should match to what you'd get by printing the address of the variable. i.e. "printf("VAR ADDR=%p\n",&var)".
"-Minfo=accel" shows that right before the parallel region, there is an implicit copyin (this), which probably should not happen? (this is question 3)
It's a possible issue. By default, the compiler will include an implicit copy for variables in a compute region if the compute region is not enclosed in a structured data region. However, "present_or" semantics are used so at runtime the address is checked against the present table. If it's there, then the device copy is used. Otherwise the variable is copied.

You might consider adding a "present" clause on your compute regions. This will have the runtime use the address found in the present table and abort if it's not there.

It's possible that the class object used in the routine is a reference so may not have the same host address as the original variable.
I will have to re-produce a simpler code to post.
That would be great. Having an example is very helpful, otherwise, I'm just making educated guesses as the problem.
I actually have some further questions regarding outputs from acc_present_dump, but for your easy reply I will postpone asking them for now.
No worries though feel free to ask and I'll do my best to help.

-Mat

Post Reply