PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

cuda fortran questions
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 6:32 am    Post subject: cuda fortran questions Reply with quote

Greetings,

I am very new to cuda fortran, and have been struggling with getting my own (F77/F90) code to work with cuda. Actually, the program does run, and even gives correct results, but is very slow. I used nvprof to see how the program is doing in GPU, and the results were very encouraging:

Time(%) Time Calls Avg Min Max Name
63.54 4.60s 23581 195.01us 194.66us 1.13ms wavefundercuda
32.97 2.39s 23582 101.20us 76.94us 1.21ms primdercuda
2.58 186.70ms 188685 989ns 800ns 11.04us [CUDA memcpy HtoD]
0.91 65.85ms 47163 1.40us 1.31us 22.08us [CUDA memcpy DtoH]
0.00 17.37us 1 17.37us 17.37us 17.37us denmatcuda

It seems it spent just under 7 seconds calculating things in GPU (these subroutines account for over 90% of the total runtime - I have profiled my serial version with both pgprof and gprof), but overall the elapsed time was over 3 minutes(!). For comparison, the elapsed time for a serial version is 43 seconds. I could not figure out what was going on until I looked at the program with 'top'
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6459 avolkov 20 0 76.1g 51m 20m R 100 0.1 0:05.84 denprop
- a whooping 76 GB of virtual memory, no wonder it runs slow. Of course, the serial version does not use anything close to even a gig of ram:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7439 avolkov 20 0 14856 2300 1156 R 100 0.0 0:03.56 denprop

Huge difference!

The serial version was compiled using the following flags
pgf90 -Mextend -Mbackslash -fast -Minfo=ccff
while for cuda I had:
pgf90 -Mextend -Mbackslash -fast -Minfo=ccff -DUSE_CUDA -lstdc++ -Mcuda

I thought there were issues with my CUDA code (and probably there are many!), but then i compiled and ran stream_cudafor.cuf (renamed to stream_cudafor.f, and changed ntimes to 100)
pgfortran stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda
, and got the same virtual memory issue:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7834 avolkov 20 0 76.3g 325m 18m R 100 0.5 0:01.77 stream_cudafor

Compiling the original version with
pgf90 -Mfixed -O2 stream_cudafor.cuf -o stream_cudafor -lstdc++
gives the same memory issue

Clearly, I am doing something wrong when compiling cuda programs, or there is something wrong with my installation of PGI and GCC compilers...

I use OpenSUSE 12.1 x86_64, kernel 3.1.10-1.16-desktop

> pgf90 --version

pgf90 12.5-0 64-bit target on x86-64 Linux -tp bulldozer
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.

Because by default, OpenSUSE comes with 4.6.2 compiler which does not seem to be compatible with cuda fortran:

> pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforar_baPUMKUle.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted

I had to download, compile and install gcc 4.4.7

The reason, I have to add -lstdc++ switch is because if I do not do that, I get:
> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda
/usr/bin/ld: /usr/local/pgi/linux86-64/12.5/lib/libcudafor4.a(pgi_memset.o): undefined reference to symbol '__gxx_personality_v0@@CXXABI_1.3'
/usr/bin/ld: note: '__gxx_personality_v0@@CXXABI_1.3' is defined in DSO /usr/local/gcc-4.4.7/lib64/libstdc++.so.6 so try adding it to the linker command line
/usr/local/gcc-4.4.7/lib64/libstdc++.so.6: could not read symbols: Invalid operation

I have a GTX 460 graphics card:

avolkov@wizard:/usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release> ./deviceQuery
[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce GTX 460"
CUDA Driver Version / Runtime Version 5.0 / 4.2
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 7) Multiprocessors x ( 48) CUDA Cores/MP: 336 CUDA Cores
GPU Clock rate: 1430 MHz (1.43 GHz)
Memory Clock rate: 1800 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce GTX 460

What I am doing wrong? Any help is greatly appreciated.

Thank you,
Anatoliy
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Thu Jul 26, 2012 8:44 am    Post subject: Reply with quote

Hi Anatoliy,

Let's start with the easy one:
Quote:

> pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforar_baPUMKUle.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported!
By default, release 12.5 uses CUDA 4.0 which doesn't support gcc > 4.5. The work around is to use CUDA 4.1 by adding the "4.1" or "cuda4.1" sub-option to -Mcuda.
Code:
pgf90 stream_cudafor.f -o stream_cudafor -lstdc++ -Mcuda=cc20,cuda4.1


As for the memory problem, I'm not able to recreate the issue on my GTX460 system. My openSuse 12.1 system has a GTX690 card but again I don't see the issue. My best guess is that it has to do with the downgrading of the GCC version and the use of the stdc++ library (I didn't try recreating this)

If you have access to another system, I'd like to know if you see the same problem? Also, can you revert back to using GCC 4.6 and add the CUDA 4.1 flag? Finally, the just released PGI 12.6 version defaults to using CUDA 4.2 and is another thing to try.

Hope this helps,
Mat
Back to top
View user's profile
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 9:43 am    Post subject: Reply with quote

Hello Mat,

Many thanks for your prompt reply.

When using -Mcuda=cuda4.0 option, I get:

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cuda4.0
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.0/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforn7VfNlhJr3fJ.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.0/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.5 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted

When using -Mcuda=cuda4.1 option, I get:

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cuda4.1
In file included from /usr/local/pgi/linux86-64/2012/cuda/4.1/include/cuda_runtime.h:59:0,
from /tmp/pgcudaforZoXfzC0dVaTr.gpu:1:
/usr/local/pgi/linux86-64/2012/cuda/4.1/include/host_config.h:82:2: error: #error -- unsupported GNU version! gcc 4.6 and up are not supported!
PGF90-F-0000-Internal compiler error. Device compiler exited with error status code 0 (stream_cudafor.f: 167)
PGF90/x86-64 Linux 12.5-0: compilation aborted

I guess the only option left is to try the newest release 12.6
Back to top
View user's profile
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 11:13 am    Post subject: Reply with quote

Hello Mat,

I have upgraded my installation of PGI compiler suite to version 12.6.

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 --version

pgf90 12.6-0 64-bit target on x86-64 Linux -tp bulldozer
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.

I can now compile and link stream_cudafor without any issues:

avolkov@wizard:/data1/avolkov/src/cuda> pgf90 stream_cudafor.f -o stream_cudafor -Mcuda=cc20,cuda4.2

avolkov@wizard:/data1/avolkov/src/cuda> ldd stream_cudafor
linux-vdso.so.1 => (0x00007fff225ff000)
libcudart.so.4 => /usr/local/cuda4.2/cuda/lib64/libcudart.so.4 (0x00002b1b9f4e9000)
libnuma.so => /usr/local/pgi/linux86-64/12.6/lib/libnuma.so (0x00002b1b9f745000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b1b9f86c000)
librt.so.1 => /lib64/librt.so.1 (0x00002b1b9fa89000)
libm.so.6 => /lib64/libm.so.6 (0x00002b1b9fc92000)
libc.so.6 => /lib64/libc.so.6 (0x00002b1b9fee9000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b1ba0279000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002b1ba047e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b1ba0788000)
/lib64/ld-linux-x86-64.so.2 (0x00002b1b9f2c6000)

However, when running stream_cudafor I can still see that the amount of virtual memory used by stream_cudafor is 76.4 GB, and even the resident size is 325 mb

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6658 avolkov 20 0 76.4g 325m 18m R 101 0.5 0:01.82 stream_cudafor
I guess the only reason I can actually run it is because my workstation has 64 GB of RAM or 64 GB of swap space).

I now think there is something wrong with my OpenSUSE system configuration, but I have no idea where to even start

I think I will get a trial version of PGI compiler and test it on another system
Back to top
View user's profile
AnatoliyVolkov32648



Joined: 25 May 2012
Posts: 8

PostPosted: Thu Jul 26, 2012 11:30 am    Post subject: Reply with quote

I have just tried to run the stream_cudafor created as shown above on another machine
that has both cuda4.2 and pgi installation nfs mounted from the first one. This second workstation also runs OpenSUSE 12.1 x86-64, but has different hardware (different mainboard, cpu and memory) and instead of GTX 460 it has Quadro 5000:

avolkov@viz01:/data1/avolkov/src/cuda> /usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery
[deviceQuery] starting...

/usr/local/cuda4.2/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "Quadro 5000"
CUDA Driver Version / Runtime Version 5.0 / 4.2
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2560 MBytes (2683895808 bytes)
(11) Multiprocessors x ( 32) CUDA Cores/MP: 352 CUDA Cores
GPU Clock rate: 1026 MHz (1.03 GHz)
Memory Clock rate: 1500 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 4.2, NumDevs = 1, Device = Quadro 5000
[deviceQuery] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done!


So, when on the second machine (viz01), I can see the stream_cudafor excutable and can verify that all appropriate libraries are available:

avolkov@viz01:/data1/avolkov/src/cuda> ldd ./stream_cudafor
linux-vdso.so.1 => (0x00007fff1b7ff000)
libcudart.so.4 => /usr/local/cuda4.2/cuda/lib64/libcudart.so.4 (0x00002b2b759f6000)
libnuma.so => /usr/local/pgi/linux86-64/12.6/lib/libnuma.so (0x00002b2b75c52000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b2b75d73000)
librt.so.1 => /lib64/librt.so.1 (0x00002b2b75f90000)
libm.so.6 => /lib64/libm.so.6 (0x00002b2b76199000)
libc.so.6 => /lib64/libc.so.6 (0x00002b2b763f0000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b2b7677f000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00002b2b76984000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b2b76c8e000)
/lib64/ld-linux-x86-64.so.2 (0x00002b2b757d3000)

However, when running it, I get the following:
avolkov@viz01:/data1/avolkov/src/cuda> ./stream_cudafor
Error: illegal instruction, illegal opcode
rax 0000000000000001, rbx 000000000000001e, rcx 0000000000000000
rdx 00002ae6efd94450, rsp 00007fff06b31c20, rbp 00007fff06b31c50
rsi 0000000000000009, rdi 00000000ffffffff, r8 000000000000ffff
r9 0000000000000001, r10 00002ae6efa87a20, r11 000000000000000b
r12 0000000000000001, r13 00007fff06b31e20, r14 0000000000000000
r15 0000000000000000
--- traceback not available
Abort

The way I see it, both GTX 460 and Quadro 5000 should support CUDA 2.0 (that is how my executable was compiled), but for some reason it does not work....

Does this information help in any way ?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group