PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Core binding on more cores than found at compile time

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
njustn



Joined: 09 Nov 2011
Posts: 22

PostPosted: Sun Dec 16, 2012 7:29 am    Post subject: Core binding on more cores than found at compile time Reply with quote

Hi,

I'm running into a problem that feels like it must be user error, but I can't seem to determine how to fix it. If I compile a code containing OpenMP pragmas on a system with 8 cores, it runs fine locally and on all systems I try with 8 or fewer cores whether I use MP_BIND or not (compiling with -mp=allcores or -mp=bind all fine). When I run a binary compiled on that system on a system with 12 cores and set it to bind, I get this followed by the program crashing:

Code:

mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address


On a system with 24 cores, again crashing:
Code:

mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address
mbind: Bad address


Always the number of cores minus 8. It doesn't seem to matter if I specify MP_BLIST or any of that, if binding is turned on I get that message once for every core over the number of cores on the compiling machine. Is this expected behavior? If so is there some way I can set a number of cores higher than the number in the compiling machine so I can effectively use the others?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Dec 17, 2012 11:07 am    Post subject: Reply with quote

Hi njustn,

This is a new one so I'm not sure what's wrong. There isn't anything in how we bind tied to the compilation system, it's all determined at runtime, so this behaviour is odd.

The error itself is coming out of the NUMA library, so I'm wondering if the problem lies there. First, can you run "ldd my.exe" and see which libnuma is being picked up? Also, did you link statically (-Bstatic)? Finally, what happens if you don't use libnuma (-mp=nonuma)?

- Mat
Back to top
View user's profile
njustn



Joined: 09 Nov 2011
Posts: 22

PostPosted: Mon Dec 17, 2012 3:41 pm    Post subject: Reply with quote

Hi Mat,

I did not link statically, so it's using the libnuma on the other machine. It may be a different version, but then I would think it would fail on all the cores rather than cores minus 8...

ldd output:
Code:

        linux-vdso.so.1 =>  (0x00007fff967ff000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fef98774000)
        libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00007fef9851b000)
        libm.so.6 => /lib/libm.so.6 (0x00007fef98299000)
        libdl.so.2 => /lib/libdl.so.2 (0x00007fef98095000)
        libcolamd.so.2.7.1 => /usr/lib/libcolamd.so.2.7.1 (0x00007fef97e8d000)
        libnuma.so => /usr/lib/libnuma.so (0x00007fef97c85000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fef97a69000)
        libc.so.6 => /lib/libc.so.6 (0x00007fef97706000)
        libz.so.1 => /usr/lib/libz.so.1 (0x00007fef974ef000)
        librt.so.1 => /lib/librt.so.1 (0x00007fef972e7000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fef96fd2000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fef96dbc000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fef99368000)



Compiling with -mp=nonuma does allow the program to run whether I set bind or not, does it bind the threads in that case though? I had been using a workaround and just manually binding them with CPU_SET, so I know you can without libnuma, but does it?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Dec 17, 2012 5:09 pm    Post subject: Reply with quote

Quote:
libnuma.so => /usr/lib/libnuma.so (0x00007fef97c85000)
Ok, it's using the system's libnuma and not our dummy version.

Quote:
Compiling with -mp=nonuma does allow the program to run whether I set bind or not, does it bind the threads in that case though?
No, it's not binding in this case. Though it does tell us that it's a problem with libnuma (or how the PGI runtime is interacting with it)

Quote:
I had been using a workaround and just manually binding them with CPU_SET, so I know you can without libnuma, but does it?
I personally use the 'numactl' or 'taskset' utility, but have not used CPU_SET.

Are you able to determine the libnuma.so version? Which OS are each system?

Can you compile (with binding, i.e. just -mp) and run on the 12 core system and see if the issue still occurs?

Thanks,
Mat
Back to top
View user's profile
rgr



Joined: 16 Jan 2013
Posts: 1

PostPosted: Wed Jan 16, 2013 3:48 am    Post subject: Reply with quote

Hi,

i had mysterious mbind: Bad Address errors when I did not call numa_available() before anything else, as told by the man page.

This occured only on Intel machines, also on the one I used for compiling.
Running the binary on an AMD machine worked fine without calling numa_available.

- Robert
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group