PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

using cuda libraries with OpenACC

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
conqueror



Joined: 30 Jan 2012
Posts: 1

PostPosted: Thu Jul 12, 2012 12:34 pm    Post subject: using cuda libraries with OpenACC Reply with quote

Hello,
Here is an attempt to calculate pi using cuRand and OpenACC.
This example works fine with PGI 12.3 and CUDA 4.0.
Compiled program by PGI 12.5 and CUDA 4.0 always give 0 as the result.
Compiling this example with CUDA 4.1 give the following error:

Code:
$ make
pgcc -acc -ta=nvidia -Minfo=all main.c pi.c -I /opt/pgi/linux86-64/2012/cuda/4.1/include -L /opt/pgi/linux86-64/2012/cuda/4.1/lib64 -lcudart -lcurand -o pi_acc
main.c:
PGC-F-0249-#error --  --- !!! UNKNOWN COMPILER: please provide a CUDA compatible definition for '__align__' !!! --- (/opt/pgi/linux86-64/2012/cuda/4.1/include/host_defines.h: 128)
PGC/x86-64 Linux 12.3-0: compilation aborted
pi.c:
pi:
     10, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             CC 1.3 : 10 registers; 312 shared, 28 constant, 0 local memory bytes; 25% occupancy
             CC 2.0 : 12 registers; 264 shared, 72 constant, 0 local memory bytes; 16% occupancy
         14, Sum reduction generated for s
make: *** [pi_acc] Error 2


Code:
cat pi.c
#include <math.h>
// Iterate over pairs of (x,y) from [0,1] x [0,1]
// Increment pointer only if (x,y) is inside a circle with R = 0.5
long pi(float *dx, float *dy, long n)
{
    long i, s = 0;   
    #pragma acc data deviceptr(dx, dy)
    #pragma acc kernels
    for (i = 0; i < n; i++) {
   float x = dx[i] - 0.5;
   float y = dy[i] - 0.5;
   s += (x*x + y*y < 0.25);
    }
    return s;   
}

Code:


#include <stdio.h>
#include <stdlib.h>

#include <curand.h>

#include "common.h"

int main(int argc, char **argv) {

    float *dx, *dy; // device arrays
    long s; // counter of successful tries
    long n; // number of tries

    if (argc < 2) {
   fprintf(stderr, "Usage: %s N\n", argv[0]);
   return EXIT_FAILURE;
    }

    if ((n = atol(argv[1])) <= 0) {
   fprintf(stderr, "N must be positive\n");
   return EXIT_FAILURE;
    }

    // Allocate memory for device arrays to be filled with random values
    CUDA_CALL(cudaMalloc((void *)&dx, n * sizeof(float)));
    CUDA_CALL(cudaMalloc((void *)&dy, n * sizeof(float)));

    // Initialize CURAND generator, seed it and generate data
    curandGenerator_t g;
    CURAND_CALL(curandCreateGenerator(&g, CURAND_RNG_PSEUDO_DEFAULT));
    CURAND_CALL(curandSetPseudoRandomGeneratorSeed(g, time(NULL)));
    CURAND_CALL(curandGenerateUniform(g, dx, n));
    CURAND_CALL(curandGenerateUniform(g, dy, n));
   
    // Perform computations
    s = pi(dx, dy, n);

    // Free memory
    CUDA_CALL(cudaFree(dx));
    CUDA_CALL(cudaFree(dy));
   
    // Print out the result
    printf("%lf\n", (double) s / n * 4.0);
   
    return EXIT_SUCCESS;
       
}

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <curand.h>

long pi(float *, float *, long);

inline long int clock_difftime(struct timespec t1, struct timespec t0) {
    return (t1.tv_sec - t0.tv_sec) + (float) (t1.tv_nsec - t0.tv_nsec);
}

// Convenience macros for CUDA and CURAND calls

#define CUDA_CALL(x) { cudaError_t rc = ( x ); if (rc != cudaSuccess) { \
        fprintf(stderr, "CUDA error %d (%s) at %s:%d\n", rc, cudaGetErrorString(rc), __FILE__, __LINE__); \
        exit(EXIT_FAILURE); \
}};

#define CURAND_CALL(x) { cudaError_t rc = ( x ); if (rc != cudaSuccess) { \
        fprintf(stderr, "CUDA error %d (%s) at %s:%d\n", rc, cudaGetErrorString(rc), __FILE__, __LINE__); \
        exit(EXIT_FAILURE); \
}};

// Convert CURAND error code to string

char *curandGetErrorString(curandStatus_t rc) {

    switch(rc) {
   case CURAND_STATUS_SUCCESS:
       return "No errors";
   case CURAND_STATUS_VERSION_MISMATCH:
       return "Header file and linked library version do not match";

   case CURAND_STATUS_NOT_INITIALIZED:
       return "Generator not initialized";
   case CURAND_STATUS_ALLOCATION_FAILED:
       return "Memory allocation failed";
   case CURAND_STATUS_TYPE_ERROR:
       return "Generator is wrong type";
   case CURAND_STATUS_OUT_OF_RANGE:
       return "Argument out of range";
   case CURAND_STATUS_LENGTH_NOT_MULTIPLE:
       return "Length requested is not a multple of dimension";
// In CUDA >= 4.1 only
#if CUDART_VERSION >= 4010
   case CURAND_STATUS_DOUBLE_PRECISION_REQUIRED:
       return "GPU does not have double precision required by MRG32k3a";
#endif
   case CURAND_STATUS_LAUNCH_FAILURE:
       return "Kernel launch failure";
   case CURAND_STATUS_PREEXISTING_FAILURE:
       return "Preexisting failure on library entry";
   case CURAND_STATUS_INITIALIZATION_FAILED:
       return "Initialization of CUDA failed";
   case CURAND_STATUS_ARCH_MISMATCH:
       return "Architecture mismatch, GPU does not support requested feature";
   case CURAND_STATUS_INTERNAL_ERROR:
       return "Internal library error";
   default:
       return "Unknown error";
    }

}


Would you please tell me what I have to do to make it work with newer CUDA versions?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Thu Jul 12, 2012 5:17 pm    Post subject: Reply with quote

Hi conqueror,

There are two separate issues here. First, it does appear that 12.5 introduced a problem where your use of a "long" data type as the sum reduction variable no longer works correctly. If you change "s" to be an "int" or "float", it will work around the issue. I have sent a report (TPR#18815) to our engineers for further investigation.

For example:
Code:
long pi2(float *dx, float *dy, long n)
{
  long i;
  int s = 0;
#pragma acc data deviceptr(dx, dy)
#pragma acc kernels
  for (i = 0; i < n; i++) {
    float x = dx[i] - 0.5;
    float y = dy[i] - 0.5;
    s+= ((x*x + y*y) < 0.25);
  }

  return (long) s;
}

% pgcc -acc main.c pi.c -I/opt/pgi/linux86-64/2012/cuda/4.0/include -L/opt/pgi/linux86-64/2012/cuda/4.0/lib64 -lcurand -o pi_acc -V12.5
main.c:
pi.c:
% pi_acc   100000000
Both should print 3.141...
Using Long: 0.000000
Using Int: 3.141509

Quote:
Would you please tell me what I have to do to make it work with newer CUDA versions?
Using CUDA C calls from pgcc compiled code is not actually supported. It happens to work in this case since we ported NVIDIA's CUDA 4.0 header files for use with our CUDA X86 C++ compiler. In our 12.6 release, CUDA X86 will switch to using CUDA 4.1, so you should be able to uses these headers as well.

Thanks!
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group