PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

cuda fortran questions
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 11:45 am    Post subject: Reply with quote

Tried to recompile the code on the Quadro 5000 machine, but even after I've installed trial keys, I get:
avolkov@viz01:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cc20,cuda4.2
pgi-f95-lin64: LICENSE MANAGER PROBLEM: Failed to checkout license
Back to top
View user's profile
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 1:06 pm    Post subject: Reply with quote

I have downloaded and compiled NPB2.3 - FT Benchmark - C + CUDA
(http://hpcgpu.codeplex.com/releases/view/34770)
When compiled using nvcc, it shows exactly the same behavior as my fortran test codes:

> nvcc c_randdp.cu ft.cu wtime.cu c_timers.cu c_print_results.cu -o cuda_ft.exe -I /usr/local/cuda/NVIDIA_GPU_Computing_SDK/C/common/inc -O3 -arch sm_13

> ldd cuda_ft.exe
linux-vdso.so.1 => (0x00007fffc9bff000)
libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00002ba475bb0000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002ba475e39000)
libm.so.6 => /lib64/libm.so.6 (0x00002ba476143000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ba47639a000)
libc.so.6 => /lib64/libc.so.6 (0x00002ba4765b1000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ba476941000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba476b45000)
librt.so.1 => /lib64/librt.so.1 (0x00002ba476d63000)
/lib64/ld-linux-x86-64.so.2 (0x00002ba47598d000)

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19528 avolkov 20 0 76.2g 150m 12m R 100 0.2 0:03.98 cuda_ft.exe

btw, this is on a different OpenSUSE 12.1 x86_64 machine that also has a GTX 460 card:

/usr/local/cuda/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce GTX 460"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 7) Multiprocessors x ( 48) CUDA Cores/MP: 336 CUDA Cores
GPU Clock rate: 1430 MHz (1.43 GHz)
Memory Clock rate: 1800 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce GTX 460
[deviceQuery] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done!
Back to top
View user's profile
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 1:17 pm    Post subject: Reply with quote

Took cuda_ft.exe executable created on the GTX 460 machine qnd ran it on the Quadro 5000 machine (both run OpenSUSE 12.1 x86_64, but have different motherboards and cpus: GTX460 machine ASUS KGPE-D16 server board, 2 x AMD Opteron 6234 processors, 64GB DDR3 ECC RAM; Quadro 5000: ASUS SABERTOOTH 990FX maniboard, AMD Phenom II X6 1090T proc, 16 GB DDR3 )
Now the executable asks for 28.8 GB of virtual memory:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24579 avolkov 20 0 28.2g 156m 21m R 101 1.0 0:01.85 cuda_ft.exe

UPDATE 1: All my test fortran codes do the same: ask for 76 GB of virtual memory on the Opteron 6234 machine, and 28.8 GB on the Phenom II machine.
The reason I could not run stream_cudafor compiled on the Opteron 6234 machine on Phenom II computer is because by default, pgf90 uses '-tp bulldozer' flag. Opteron 6234 is indeed based on the Bulldozer architecture, but Phenom II is a K10 cpu. Using '-tp x64' flag fixed this compatibility problem
Unfortunately, my own code still runs very slow in the GPU version (no matter whether it runs on GTX450 or Quadro 5000), but I now think this is related to my programming rather than any other issues.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Fri Jul 27, 2012 8:32 am    Post subject: Reply with quote

Hi Anatoily,

Just to be clear, you believe the issue is with your particular system running the GTX460? You're the third or fourth person that has had a non-reproducible error with these cards. The problems are all different with the only commonality being the GTX460. For the last one, I had our IT build me a system with a GTX460, but still couldn't reproduce the problem.

With the GTX line, only the GPU chip is made by NVIDIA. The rest is assembled by various third parties. The Quadro and Tesla brands are made by NVIDIA so typically have a high quality standard and why NVIDIA recommends using GTX for only graphics.

I have no idea if a flaky card is the root cause of the problem (doubtful since it's virtual memory), poor interaction with the CUDA driver (more likely), or something else entirely (most likely).

Quote:
Unfortunately, my own code still runs very slow in the GPU version (no matter whether it runs on GTX450 or Quadro 5000), but I now think this is related to my programming rather than any other issues.
Have you profiled your code to see where the time is coming from?

Three things to look for:

- Excessive data movement.
- Not enough parallelism.
- Data access on the device.


- Mat
Back to top
View user's profile
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Fri Jul 27, 2012 2:06 pm    Post subject: Reply with quote

Hello Mat,

Honestly, I do not know what to think, because I seem to have the same issue on Quadro 5000. Let me try to explain what has been happening with my code. I have profiled a serial version of my code with pgprof and gprof. There are several subroutines that account for almost 99% of the total cpu time. These subroutines are being constantly called by the main code. The good news is what these subroutines do is parallel in nature. I modified my code to run these subroutines on GPU. I have profiled the cuda-enabled executable using nvprof and I can see that a) calculations in GPU are done very fast (seconds), and b) memcpy in and out of GPU is also done fairly quickly
(~0.2 sec):

Opteron 6234 + GTX 460:
********************************

>nvprof
Time(%) Time Calls Avg Min Max Name
64.87 4.78s 23581 202.83us 202.53us 1.00ms wavefundercuda
31.75 2.34s 23582 99.27us 75.81us 2.29ms primdercuda
2.51 184.73ms 188685 979ns 800ns 9.76us [CUDA memcpy HtoD]
0.88 64.62ms 47163 1.37us 1.31us 14.94us [CUDA memcpy DtoH]
0.00 15.97us 1 15.97us 15.97us 15.97us denmatcuda

> top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32697 avolkov 20 0 76.1g 68m 20m R 100 0.1 0:46.17 denprop


Phenom II + Quadro 5000:
*********************************

>nvprof
Time(%) Time Calls Avg Min Max Name
69.80 6.18s 23581 262.26us 261.77us 1.31ms wavefundercuda
27.09 2.40s 23582 101.78us 81.93us 2.21ms primdercuda
2.24 198.78ms 188685 1.05us 864ns 9.98us [CUDA memcpy HtoD]
0.86 76.53ms 47163 1.62us 1.57us 17.60us [CUDA memcpy DtoH]
0.00 16.10us 1 16.10us 16.10us 16.10us denmatcuda

>top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27488 avolkov 20 0 28.1g 27m 22m R 100 0.2 0:13.33 denprop

All executable and input files for these two runs were the same, but there is a huge difference in terms of memory usage (virtual: 76 gb vs 28 gb, resident: 68 mb vs 27 mb, shared: 20 mb vs 22 mb). For comparison, for serial executable, the memory usage is much smaller- 17 mb of virtual memory, 3 mb of resident, and 1.4 mb of shared:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29900 avolkov 20 0 17212 2668 1448 R 100 0.0 0:21.92 denprop

Note that in the CUDA version, no extra arrays were allocated on the host compared to the serial version. All new arrays are allocated on device. Why such a great difference in memory usage? It does not make sense to me.

Also the overall elapsed time of both of these cuda-enable runs (~3.5 min) is much greater than that of even a serial version (~40 sec). I now believe it is related to the fact that cuda-enabled executable requests too much of virtual memory (gigabytes). Exactly how much seems to depend on the host hardware (75 gb on Opteron 6234 and 28 gb on Phenom II). I do not know how the graphics card is related to all that: all my Opteron 6xxx machines have GTX 460, and Quadro 5000s are only available on Phenom II machines).

I know it may sound strange, but do you think you could possibly try running my example on one of your machines (preferably, GTX 460 and OpenSUSE 12.1 x86_64) ? I would really like to know if there is a problem with my OpenSUSE installation, my hardware, or my programming.

Thanks,
Anatoliy
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group