PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

PGI 13.1 breaks on acc regions inside parallel regions

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
njustn



Joined: 09 Nov 2011
Posts: 22

PostPosted: Mon Feb 04, 2013 11:42 am    Post subject: PGI 13.1 breaks on acc regions inside parallel regions Reply with quote

Hi, this is a new version of an issue that has been around for a while. Classically the applications would fail at runtime unless every acc region inside an omp parallel region were surrounded by an acc data region. Now with 13.1 I can't seem to get any acc regions to work inside omp parallel regions at all. For example, look at the following code. (apologies for the length but this is the simplest way to put it through)

Code:

#include <stdio.h>
#include <unistd.h>
#include <omp.h>
#include <openacc.h>
/* #include <cuda.h> */
#include <sys/time.h>
#include <sys/types.h>

#define SIZE 1000

void works(){
    int data[15][SIZE]={0};
    for(int j=0; j<4; j++)
    {
        int * stuff = data[j];
        if(j < acc_get_num_devices(acc_device_nvidia)){
            acc_set_device_num(j, acc_device_nvidia);
            fprintf(stderr,"prelaunch: in thread %d, testing device %d\n", j, acc_get_device_num(acc_device_nvidia));
            int i;
#pragma acc data  copyout(stuff[0:SIZE])
            {
                fprintf(stderr,"data environment initialized: in thread %d, testing device %d\n", j, acc_get_device_num(acc_device_nvidia));
#pragma acc kernels
                for(i = 0; i<SIZE; i++)
                {
                    stuff[i] = 1;
                }
                fprintf(stderr,"ACC region complete: in thread %d, testing device %d\n", j, acc_get_device_num(acc_device_nvidia));
            }
            printf("device %d done, checking results\n", j);
            for(i = 0; i<SIZE; i++)
            {
                if(stuff[i] != 1){
                    printf("fail after: %d\n", i);
                    exit(1);
                }
            }
        }
    }
}

void dies(){
    int  data[15][SIZE]={0};
#pragma omp parallel
    {
        int j = omp_get_thread_num();
        int * stuff = data[j];
        if(j < acc_get_num_devices(acc_device_nvidia)){
            acc_set_device_num(j, acc_device_nvidia);
            fprintf(stderr,"prelaunch: in thread %d, testing device %d\n", j, acc_get_device_num(acc_device_nvidia));
            int i;
#pragma acc data  copyout(stuff[0:SIZE])
            {
                fprintf(stderr,"data environment initialized: in thread %d, testing device %d\n", j, acc_get_device_num(acc_device_nvidia));
#pragma acc kernels
                for(i = 0; i<SIZE; i++)
                {
                    stuff[i] = 1;
                }
                fprintf(stderr,"ACC region complete: in thread %d, testing device %d\n", j, acc_get_device_num(acc_device_nvidia));
            }
            printf("device %d done, checking results\n", j);
            for(i = 0; i<SIZE; i++)
            {
                if(stuff[i] != 1){
                    printf("fail after: %d\n", i);
                    exit(1);
                }
            }
        }
    }
}

int main(int argc, char * argv[])
{
    int tid;
    if(argc > 1){
        dies();
    }else{
        works();
    }
    return 0;
}


If this is compiled with the following command, both branches work (testable with ./test and ./test 1).

Code:

/opt/pgi/linux86-64/2012/bin/pgcc -Minfo=accel,mp -mp=allcores -O3 -g -ta=nvidia:cuda4.1,keepgpu,keepptx -acc


Output looks like this.

Works branch:
Code:

prelaunch: in thread 0, testing device 0
data environment initialized: in thread 0, testing device 0
ACC region complete: in thread 0, testing device 0
device 0 done, checking results
prelaunch: in thread 1, testing device 1
data environment initialized: in thread 1, testing device 1
ACC region complete: in thread 1, testing device 1
device 1 done, checking results
prelaunch: in thread 2, testing device 2
data environment initialized: in thread 2, testing device 2
ACC region complete: in thread 2, testing device 2
device 2 done, checking results
prelaunch: in thread 3, testing device 3
data environment initialized: in thread 3, testing device 3
ACC region complete: in thread 3, testing device 3
device 3 done, checking results


Dies branch:
Code:

prelaunch: in thread 3, testing device 3
prelaunch: in thread 0, testing device 0
prelaunch: in thread 2, testing device 2
prelaunch: in thread 1, testing device 1
data environment initialized: in thread 0, testing device 0
ACC region complete: in thread 0, testing device 0
device 0 done, checking results
data environment initialized: in thread 1, testing device 1
data environment initialized: in thread 3, testing device 3
ACC region complete: in thread 1, testing device 1
ACC region complete: in thread 3, testing device 3
data environment initialized: in thread 2, testing device 2
ACC region complete: in thread 2, testing device 2
device 1 done, checking results
device 3 done, checking results
device 2 done, checking results


On the other hand compiled with 13.1 as with the following line, it dies on the "dies" branch.

Code:

pgcc -Minfo=accel,mp -mp=allcores -O3 -g -ta=nvidia:cuda5.0,keepgpu,keepptx -acc


Works branch:
Code:

prelaunch: in thread 0, testing device 0
data environment initialized: in thread 0, testing device 0
ACC region complete: in thread 0, testing device 0
device 0 done, checking results
prelaunch: in thread 1, testing device 1
data environment initialized: in thread 1, testing device 1
ACC region complete: in thread 1, testing device 1
device 1 done, checking results
prelaunch: in thread 2, testing device 2
data environment initialized: in thread 2, testing device 2
ACC region complete: in thread 2, testing device 2
device 2 done, checking results
prelaunch: in thread 3, testing device 3
data environment initialized: in thread 3, testing device 3
ACC region complete: in thread 3, testing device 3
device 3 done, checking results


Dies branch:
Code:

prelaunch: in thread 0, testing device 0
prelaunch: in thread 3, testing device 3
prelaunch: in thread 2, testing device 2
prelaunch: in thread 1, testing device 1
data environment initialized: in thread 1, testing device 1
data environment initialized: in thread 0, testing device 0
data environment initialized: in thread 2, testing device 2
data environment initialized: in thread 3, testing device 3
call to cuMemcpyDtoHAsync returned error 1: Invalid value
call to cuStreamSynchronize returned error 4: Deinitialized


Has anyone else run into this? I've been trying to fix it for a few days now with no success, and have run out of ideas to try. Since the compile lines do specify cuda versions, I did test this with different cuda versions with the same result.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Mon Feb 04, 2013 2:17 pm    Post subject: Reply with quote

Hi njustn,

Thanks for the report. This does look like problem with 13.1. The multi-device support was completely updated in order to support multiple device types (i..e. NVIDIA, AMD, Intel, etc). Unfortunately, it appears that there are few issues to be worked out.

I submitted a problem report (TPR#19102) to our engineers. The good news is your code works with our internal development compiler meaning that our engineers have already found and fix this issue. I'm not sure on it's status, but hopefully this means we can have the fix in the 13.2 compilers due out here shortly.

- Mat
Back to top
View user's profile
jtull



Joined: 30 Jun 2004
Posts: 438

PostPosted: Fri Nov 15, 2013 12:02 pm    Post subject: TPR 19102 is fixed Reply with quote

19102 - ACC: User code gets runtime error with mixed OpenMP and OpenACC program. Worked in 12.10.

This has been fixed since the 13.3 release.

thanks,
dave
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group