Overview

PGI Compiler Assisted Software Testing (PCAST) is a set of capabilities intended to help test for program correctness, and determine points of divergence. PCAST is useful for detecting when results diverge between CPU and GPU versions of code, and also between the same code run on different processor architectures.

There are three ways to invoke PCAST:

  1. Through the pgi_compare run-time call
  2. Through the acc_compare run-time call
  3. With the autocompare compiler flag

We'll use the following simple example to illustrate using each approach. This OpenACC C program allocates two arrays on the heap, copies the data to the GPU, and creates gangs of workers to execute the inner loop. The next sections demonstrate the different ways to use PCAST to test for program correctness.

int main() {
  int size = 1000;
  int i, t;
  float *a1;
  float *a2;

  a1 = (float*)malloc(sizeof(float)*size);
  a2 = (float*)malloc(sizeof(float)*size);

  for (i = 0; i < size; i++) {
    a1[i] = 1.0f;
    a2[i] = 2.0f;
  }

#pragma acc data copy(a1[0:size], a2[0:size])
  {
    for (t = 0; t < 5; t++) {
      #pragma acc parallel
      for(i = 0; i < size; i++) {
	a2[i] += a1[i];
      }
    }
  }

  return 0;
}

Controlling Comparison Options

PCAST's behavior is set through the PGI_COMPARE environment variable. This environment variable contains a comma-separated list of options that control various parameters of the comparison. Use it, for example, to set relative or absolute tolerance thresholds, halt at the first difference found, and more. See the PGI COMPARE table at the end for a full listing of the available options.

Auto-compare

The first, and simplest, way to invoke PCAST is through the use of the autocompare compiler flag. Setting -⁠ta=tesla:autocompare in the compiler options is the only change necessary to invoke the autocompare feature. When compiled with this option, code in OpenACC compute regions will run redundantly on the CPU as well as the GPU. Whenever computed data is copied off the GPU and back into host memory, it is compared against the values computed on the CPU. Hence, any data in a copy, copyout, or update host directive will be compared when it is copied off the device. Note that the -⁠ta=tesla:autocompare implies -⁠ta=tesla:redundant.

To use autocompare, compile the example using these compiler options:

$ pgcc -Minfo=accel -ta=tesla:autocompare -o a.out example.c

Next, running the compiled executable using the options below, results in the following output:

$ PGI_COMPARE=summary,rel=1 ./a.out

comparing a1 in example.c, function main line 26

comparing a2 in example.c, function main line 26
compared 2 blocks, 2000 elements, 8000 bytes
no errors found
 relative tolerance = 0.100000, rel=1

The 'summary' option to PGI_COMPARE will print out a short summary (the last three lines) of how much data was compared, whether any differences were found, and what the tolerances were. Here, we can see that autocompare checked two blocks of data, which contained a total of 2000 elements, and those elements took up 8000 bytes of memory. It didn't find any errors, and the relative tolerance, controlled by the 'rel' option to PGI_COMPARE, was set to 10^(-1). In general, setting 'rel=n' in PGI_COMPARE will set the tolerance to 10^(-n). The one special case is 0, which will not tolerate any errors.

If there are some differences in the data, the output should look similar to this:

$ pgcc -ta=tesla:autocompare -o a.out example.c

$ PGI_COMPARE=summary,compare,abs=1 ./a.out
PCAST a1 comparison-label:0 Float
	idx: 0 FAIL ABS  act: 8.40187728e-01 exp: 1.00000000e+00 tol: 1.00000001e-01
	idx: 1 FAIL ABS  act: 3.94382924e-01 exp: 1.00000000e+00 tol: 1.00000001e-01
	idx: 2 FAIL ABS  act: 7.83099234e-01 exp: 1.00000000e+00 tol: 1.00000001e-01
	idx: 3 FAIL ABS  act: 7.98440039e-01 exp: 1.00000000e+00 tol: 1.00000001e-01

Function Calls

Use the acc_compare function to explicitly compare data. When called, it copies the data in GPU memory back to the host and compares it with the corresponding CPU memory. acc_compare must be called from CPU code, not from a device compute region. To use acc_compare, compile with -t⁠a=tesla:redundant compiler option.

For reference, acc_compare's signature is:

acc_compare(x, n)

where x is the data to compare and n is the number of elements to compare. (Note that, unlike functions such as memcpy or malloc, the number of elements to compare is not sized in bytes.) In the example, size number of elements are compared. Even though size is an integer, the call would remain the same if we changed the type from int to, say, double.

#pragma acc data copy(a1[0:size], a2[0:size])
  {
    for (t = 0; t < 5; t++) {
      #pragma acc parallel
      for(i = 0; i < size; i++) {
	a2[i] += a1[i];
      }
      acc_compare(a2, size);
    }
  }

Compile with the following command-line options, noting the redundant flag:

$ pgcc -Minfo=accel -ta=tesla:redundant -o a.out example.c

Again, running the compiled executable using the options below, results in the following output:

$ PGI_COMPARE=summary,rel=1 ./a.out
compared 5 blocks, 5000 elements, 20000 bytes
no errors found
 relative tolerance = 0.100000, rel=1

Note that, in this example, acc_compare is called five times in the outer loop on an array of size 1000, with each element of size four bytes, totalling 20,000 bytes. With autocompare the data was compared just once at the end of the data directive, instead of in each iteration of the outer loop.

While acc_compare will keep the contents of the data in memory, pgi_compare writes the data to be compared to a file. Subsequent calls to pgi_compare will compare data between the file and data in the host memory. One advantage to this approach is that successive comparisons can be done in a quicker fashion since a "golden" copy is already on the disk. The downside to this approach, however, is that the data file can grow very large depending on the amount of data the program is using and how often comparisons are done. In general, it is a good idea to use pgi_compare sparingly, on programs where the data involved is relatively small, or when it is necessary to compare results on different machines.

Its signature is as follows, where a is the variable to be compared, "type"is a string of variable a's data type, n is the number of elements to be compared (again, not in bytes), and the last two arguments specify function name and line number respectively.

pgi_compare(a, "type", n, "str," int)

#pragma acc data copy(a1[0:size], a2[0:size])
  {
    for (t = 0; t < 5; t++) {
      #pragma acc parallel
      for(i = 0; i < size; i++) {
	a2[i] += a1[i];
      }
      
      #pragma acc update host(a2[0:size])
      pgi_compare(a2, "float," size, "main," 23);
      
    }
  }

Compiling with redundant or autocompare options are not required to use pgi_compare. Once again, running the compiled executable using the options below, results in the following output:

$ PGI_COMPARE=summary,rel=1 ./out.o
datafile pgi_compare.dat created with 5 blocks, 5000 elements, 20000 bytes
$ PGI_COMPARE=summary,rel=1 ./out.o
datafile pgi_compare.dat compared with 5 blocks, 5000 elements, 20000 bytes
no errors found
 relative tolerance = 0.100000, rel=1

Running the program for the first time, the data file "pgi_compare.dat" is created. Subsequent runs compare calculated data against this file. Use the PGI_COMPARE environment variable to set the name of the file, or force the program to create a new file on the disk with PGI_COMPARE=create. To use pgi_compare on GPU data, include an update host directive with the data to compare. pgi_compare will only write the data that is in host memory.

See also Michael Wolfe's PGInsider blog post on PCAST: PGI Compiler Assisted Software Testing for more information.

Full List of Options for PGI_COMPARE

Option Description
abs=n Compare absolute difference; tolerate differences up to 10^(-n) (only applicable to floating point types)
create Specifies that this is the run that will produce the reference results
compare Specifies that the current run will be compared with a reference file
datafile="name" Name of the file that data will be saved to, or compared against. If empty will use the default, 'pgi_compare.dat'
ieee Run IEEE checks (only implemented for floats and doubles)
outputfile="name" Save comparison output to specific file. Default behavior is to output to stderr.
patch Patch errors (outside tolerance) with correct values
patchall Patch all differences (inside and outside tolerance) with correct values
rel=n Compare relative difference; tolerate differences up to 10^(-n) (only applicable to floating point types)
report=n Report up to n fails; default is 50
reportall Report all passes and fails
reportpass Report passes; respects limit set with report=n
silent Suppress output—overrides all other output options including summary and verbose
stop Stop at first differences
summary Print summary of comparisons at end of run
ulp=n Compare Unit of Least Precision difference (only implemented for floats and doubles)
verbose Outputs more details of comparison (including patches)
verboseautocompare Outputs verbose reporting of what and where the host is comparing (autocompare only)
Click me

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X