PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

13.8 Unexpected load/store type when use cache
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
luxuia



Joined: 01 Apr 2013
Posts: 8

PostPosted: Sun Sep 15, 2013 7:47 pm    Post subject: 13.8 Unexpected load/store type when use cache Reply with quote

Hi,
i'm using pgcc 13.8-0 64-bit target on x86-64.
I test matrix mul cache sample and that work well.
But In my program, compiler report
Code:
"PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (Test.c: 37)"


Code:

void compute(   double * profile, double * matrix, double * mean,
                double * result, const int n_sample, const int gene_num,
                const int cluster_num) {
        const int column = n_sample * n_sample;

#pragma acc data copyin( profile[0:gene_num*n_saple], \
                         matrix[0:cluster_num*column], \
                         mean[0:cluster_num*n_sample] ), \
                 copyout( result[0:gene_num*cluster_num] )
       
    #pragma acc kernels loop independent
    for(int g = 0; g < gene_num; ++g) {
               
        #pragma acc loop independent 
        for(int c = 0; c < cluster_num; ++c) {
            double tmp = 0;

            #pragma acc cache(profile[g*n_sample:n_sample])
            #pragma acc loop reduction(+:tmp)
            for(int i = 0; i < n_sample; ++i) {
                double t = 0.0;
                for(int j = 0; j < n_sample; ++j) {
                    t += ( profile[g*n_sample+j] - mean[c*n_sample+j] ) *
                            matrix[c*column+j * n_sample +i];
                }
                tmp += t * (profile[g*n_sample+i] - mean[c*n_sample+i]);
            }
            result[g * cluster_num + c] = tmp;
        }
    }

}


And Line 37 is the first loops' directive
Code:
 #pragma acc kernels loop independent


And Minfo
Code:
PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (Test.c: 37)
compute:
     35, Generating copyin(profile[0:n_sample*gene_num])
         Generating copyin(matrix[0:column*cluster_num])
         Generating copyin(mean[0:n_sample*cluster_num])
         Generating copyout(result[0:gene_num*cluster_num])
     38, Loop is parallelizable
     40, Loop is parallelizable
         Accelerator kernel generated
         38, #pragma acc loop gang /* blockIdx.y */
         40, #pragma acc loop gang /* blockIdx.x */
         49, #pragma acc loop vector(128) /* threadIdx.x */
         Loop is parallelizable
     51, Loop is parallelizable
PGC/x86-64 Linux 13.8-0: compilation completed with warnings


And if I replace
Code:
#pragma acc cache(profile[g*n_sample:n_sample])

with
Code:
#pragma acc cache(profile[0:10])


Minfo is
Code:

PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (Test.c: 37)
compute:
     35, Generating copyin(profile[0:n_sample*gene_num])
         Generating copyin(matrix[0:column*cluster_num])
         Generating copyin(mean[0:n_sample*cluster_num])
         Generating copyout(result[0:gene_num*cluster_num])
     38, Loop is parallelizable
     40, Loop is parallelizable
         Accelerator kernel generated
         38, #pragma acc loop gang /* blockIdx.y */
             Cached references to size [11] block of 'profile'
         40, #pragma acc loop gang /* blockIdx.x */
         49, #pragma acc loop vector(128) /* threadIdx.x */
         Loop is parallelizable
     51, Loop is parallelizable
PGC/x86-64 Linux 13.8-0: compilation completed with warnings

What's wrong with my code as the cache problem seems had been solved in version 13.5.
Thank you in advance.
luxuia
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Sep 16, 2013 1:32 pm    Post subject: Reply with quote

Hi luxuia,

This is known issue (TPR#19395) that we have an engineer actively working on. Unfortunately, it doesn't look like he'll have in fixed in time for the 13.9, but hopefully by 13.10. In the mean time, please comment out the "cache" directive.

Best Regards,
Mat
Back to top
View user's profile
luxuia



Joined: 01 Apr 2013
Posts: 8

PostPosted: Mon Sep 16, 2013 7:05 pm    Post subject: Reply with quote

Hi mkcolg,

Oh, my heart breaking...
The in-chip memory do huge contribution in GPU computing.
So, The 'cache' issue is arise in 13 or all version?
Maybe I can try older version for better performance :).
Meanwhile, TPRs in http://www.pgroup.com/support/release_tprs_2013.htm is problem fixed, not problem find, is it?
Thanks for you help.

Best Regards,
Luxuia

mkcolg wrote:
Hi luxuia,

This is known issue (TPR#19395) that we have an engineer actively working on. Unfortunately, it doesn't look like he'll have in fixed in time for the 13.9, but hopefully by 13.10. In the mean time, please comment out the "cache" directive.

Best Regards,
Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Tue Sep 17, 2013 9:14 am    Post subject: Reply with quote

Hi Luxuia,

Quote:
The in-chip memory do huge contribution in GPU computing.
Actually, software managed shared memory is only critical with the C1060 (cc1.3). For later devices, NVIDIA added hardware caching which diminishes the need for the program to manage this memory. It still can help, but is not as much as before.

Quote:
So, The 'cache' issue is arise in 13 or all version?
While this particular error is fairly new (around 13.4 I think), the cache directive has been problematic. We're working on it.

Quote:
Maybe I can try older version for better performance :).
Possibly, but for a different reason then the cache directive. In 13.1 we started to pin memory for better performance of the data transfer. Unfortunately this actually causes a slow down when there's many free's since the CUDA routine which frees pinned memory needs to synchronize with the device. In 13.9, we added a new method by default and then added a "-ta=nvidia,pin" flag to use the pinned memory method (which does help in some cases)

Quote:
Meanwhile, TPRs in http://www.pgroup.com/support/release_tprs_2013.htm is problem fixed, not problem find, is it?
No, as I said above, the problem is not fixed in a release, but I hope it will be by 13.10.

- Mat
Back to top
View user's profile
luxuia



Joined: 01 Apr 2013
Posts: 8

PostPosted: Tue Sep 17, 2013 6:49 pm    Post subject: Reply with quote

Hi mkcolg,

Thank you very much.
I will test 13.1 and update state. :)

Thanks again for your best patient.

mkcolg wrote:
Hi Luxuia,

Quote:
The in-chip memory do huge contribution in GPU computing.
Actually, software managed shared memory is only critical with the C1060 (cc1.3). For later devices, NVIDIA added hardware caching which diminishes the need for the program to manage this memory. It still can help, but is not as much as before.
- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group