|
| View previous topic :: View next topic |
| Author |
Message |
alfvenwave
Joined: 08 Apr 2010 Posts: 77
|
Posted: Fri Apr 09, 2010 7:42 am Post subject: CUDA + OpenMP oddity - looks like a compiler bug. |
|
|
I have a very strange problem. I am running the code below (omptest.cuf) for:
export OMP_NUM_THREADS=3
I have 3 NVIDIA graphics cards, so I am attaching one card to each openMP thread.
I am compiling the code with:
pgfortran omptst.cuf -mp
Now, if I run the code as it stands, the code hangs - it tells me that all three cards have been initialized - but it just sits there.
However, if I comment out the call to curk4 with argument Fdev (see my comments in the code) , the code finishes, almost instantaneously as it should as the code doesn't actually do anything. Notice though that there is a return statement before this call. Commenting out the call should make no difference at all!
Anyone got any idea what's going on? Is this me, or a compiler bug?
Rob.
| Code: | module curk4_mod
use cudafor
implicit none
contains
! Kernel subroutines:
subroutine curk4( Fdev )
use prec_mod
implicit none
real( gpu ), device, intent(in) :: Fdev(2)
print*,'Dont even bother to call a kernel function....'
end subroutine curk4
! OMP wrapper:
subroutine omptst( F )
use prec_mod
use cudafor
implicit none
real (gpu) :: F (2)
real (gpu), device :: Fdev(2)
integer :: iflag,idev
return
!-----------------------
! If I comment out this next line, the code finishes.
! If I leave it in, the code hangs - even though there is a return
! statement above!
!-----------------------
call curk4( Fdev )
end subroutine
end module curk4_mod
program wrapper
use cudafor
use prec_mod
use curk4_mod
implicit none
integer :: i,j
integer :: numDev, iflag
real :: F(2),F2(2)
!$OMP PARALLEL PRIVATE(i,F2,iflag) SHARED(F)
!$OMP DO
do i=0,2
iflag = cudaSetDevice(i)
print*,'Device ',i,' set'
F2 = F
call omptst( F2 )
enddo
!$OMP END DO
!$OMP END PARALLEL
iflag = cudaThreadSynchronize()
print*,'Finished.'
end |
|
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Apr 09, 2010 10:15 am Post subject: |
|
|
Hi Rob,
I tried it on my system and it worked fine.
| Code: | % pgf90 curk4.cuf -mp
% setenv OMP_NUM_THREADS 3
% a.out
Device 0 set
Device 2 set
Device 1 set
Finished.
|
Granted, my system has 4 cards, but that shouldn't matter.
Does the code work without "-mp"? When "OMP_NUM_THREADS" is set to 1? What devices do you have (see output from pgaccelinfo).
- Mat |
|
| Back to top |
|
 |
alfvenwave
Joined: 08 Apr 2010 Posts: 77
|
Posted: Fri Apr 09, 2010 12:15 pm Post subject: Interesting - because that doesn't look right either. |
|
|
Hi Mat.
In answer to your question about my hardware.....
I have 3 C1060s installed + one Quadro display card. They appear in the following order as assigned by pgfortran:
Device 0 is a Tesla C1060 card
Device 1 is a Quadro FX 580 card
Device 2 is a Tesla C1060 card
Device 3 is a Tesla C1060 card
The machine is part of a cluster (w/o display) so all 4 cards should be available for number crunching.
What is a little odd is that the Quadro card is coming up as the second device. When we first installed our machine, we made the mistake of using the latest CUDA v3.0 beta - pgf refused to play nicely with this version of the cuda SDK. The difference though was that then, the quadro card came up as device number 3. After much messing around, we installed everything from scratch with cuda v2.3, and from that point on, the quadro appeared as device #1. For one card only algorithms, all 4 cards seem to work fine. My problem though as you know is in trying to use more than one card at once.
I have noticed something even more odd now. If I set OMP_NUM_THREADS=1 and run the code, it runs fine if the call to curk4 is "uncommented", but hangs if I comment it out - i.e. the opposite way around.
If I compile without the -mp option, I get the same behaviour as I do if I compile with -mp but set OMP_NUM_THREADS=1.
Now, I have just noticed something that might be highlighting the problem. Everything works fine if I set OMP_NUM_THREAD=2 and loop from i=0,1 in wrapper.cuf. This is presumably only deploying devices 0 and 1 (i.e. one C1060 and the Quadro). It's almost as if the Quadro card is blocking communication to devices 2 and 3.
I'm completely puzzled though by how changing the code "after" the return statement is making a difference.
So what do you reckon - is our openMP setup to blame? CUDA, of PGF. Or maybe our hardware? Do you think we should pull the quadro card out to see if that's to blame?
Any help greatly appreciated (I'll acknowledge you in any papers that come out of this work if you can help me find the answer)....
Have a great weekend,
Rob. |
|
| Back to top |
|
 |
alfvenwave
Joined: 08 Apr 2010 Posts: 77
|
Posted: Fri Apr 09, 2010 12:25 pm Post subject: |
|
|
Have just done a dump using pgaccelinfo. Something odd though - the utility hangs half way down the output from device #2 (at the point where it's supposed to write out "Current Free Memory"). This is now leading me more and more to believe it's a hardware problem. Here is the dump as far as it gets....
Rob.
| Code: | CUDA Driver Version 2030
Device Number: 0
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1296 MHz
Initialization time: 10974 microseconds
Current free memory 4246142976
Upload time (4MB) 1163 microseconds ( 715 ms pinned)
Download time 1493 microseconds (1252 ms pinned)
Upload bandwidth 3606 MB/sec (5866 MB/sec pinned)
Download bandwidth 2809 MB/sec (3350 MB/sec pinned)
Device Number: 1
Device Name: Quadro FX 580
Device Revision Number: 1.1
Global Memory Size: 536150016
Number of Multiprocessors: 4
Number of Cores: 32
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 8192
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1125 MHz
Initialization time: 10974 microseconds
Current free memory 491466752
Upload time (4MB) 1194 microseconds ( 854 ms pinned)
Download time 2404 microseconds (2164 ms pinned)
Upload bandwidth 3512 MB/sec (4911 MB/sec pinned)
Download bandwidth 1744 MB/sec (1938 MB/sec pinned)
Device Number: 2
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512 x 512 x 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1296 MHz
Initialization time: 10974 microseconds
|
|
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Apr 09, 2010 1:34 pm Post subject: |
|
|
Hi Rob,
I don't have any great insight here but it does seem to be a hardware issue.
I talked with one of our IT people. The only time he's seen this type of behavior is where there's not enough power to the device, another processes already running on the device, or the device is hung up. I would try the following:
1) Reboot the system, (to see if a device is hung)
2) Set your program to use just device 2 or 3. (can you even access the cards?)
4) Swap device 0 and 2 (is device 2 just bad?)
3) Move the Quadro to the last socket (is the order the problem?)
4) Remove all but one card then add them back one at a time (is Power the problem?)
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|