Tiled loop increase MSE error of a nlm denoise filter result compared to the result obtained when using collapse clause

OpenACC and CUDA Fortran
manu3193
Posts: 8
Joined: Oct 18 2018

Re: Tiled loop increase MSE error of a nlm denoise filter result compared to the result obtained when using collapse cla

Post by manu3193 » Mon May 11, 2020 9:48 am

Hello again Mat,


I replaced NPP calls with opencv functions to focus on openacc parallel region. The program compiles for multicore architecture at least for now, maybe is better to start from here to achieve appropriate speedup on target K40c GPU. I created a new git branch https://github.com/manu3193/DNLM-P/tree ... penacc/src.

It takes 20s denoising an image with size 256x256 pixels, on a Xeon quad core with AVX2 at 3.3 GHz. The command I used is

Code: Select all

./nlmfilter_multicore_fft -w 7 -n 3 -s 0.5 256x256.png


The following is the displayed information of pgi compiler.

Code: Select all

DNLM_OpenACC(const float *, int, const float *, int, float *, int, int, int, int, int, int, int, int, int, float):
     34, Generating Multicore code
         36, #pragma acc loop gang
     66, Accelerator restriction: size of the GPU copy of pWindowIJCorr,pEuclDist is unknown
         Loop is parallelizable
     68, Loop is parallelizable
     84, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
         Loop is parallelizable
     86, Loop is parallelizable
     97, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
         Loop is parallelizable
     99, Loop is parallelizable

manu3193
Posts: 8
Joined: Oct 18 2018

Re: Tiled loop increase MSE error of a nlm denoise filter result compared to the result obtained when using collapse cla

Post by manu3193 » Thu May 21, 2020 8:09 pm

Hello Mat,

Just a quick follow up here to see if I can get this filter properly accelerated.

Thanks for your help,

Post Reply