|
| View previous topic :: View next topic |
| Author |
Message |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Jan 28, 2013 12:54 pm Post subject: |
|
|
Hi Sotiris,
To achieve 2-D parallelism here, you either need to make "A" 2-D or add a reduction.
| Code: | !$acc kernel loop independent
do i=1,N
sum = 0.0
!$acc loop reduction (+:sum)
do j =1,M
sum = sum + B(i,j)
enddo
A(i) = A(i) + sum
enddo |
The caveat being that there is overhead in performing a reduction and an inner loop reduction limits the schedules that can be used (the inner loop can only be a "vector" and the outer loop must be a "gang"). So unless your inner loop has enough computation to offset this overhead, you may be better off just accelerating the outer loop. You'll need to experiment as to which method works best for your particular code.
- Mat |
|
| Back to top |
|
 |
paokara
Joined: 06 Feb 2011 Posts: 19
|
Posted: Mon Mar 18, 2013 12:26 pm Post subject: |
|
|
Hello Mat,
With have a problem with our code and we can't figure out the reason. We work on N-body problem and we've changed some of our FORTRAN functions using some OpenACC directives. We have only one "heavy" part, the part we calculate the accelerations( a 2D loop) with the way with have discussed in this topic. Also we are using a Tesla C1060.
Our implementation works fine with a small number of bodies but when we use a large number for our system (for example 10.000*10.000 for the 2-D in accelaration part) we get NaN in our results. We believe that this is a memory issue. Is that possible?
Another important issue is that we are using double precision variables for our code. Tesla C1060 has only 1/10 of the cores for double precision calculations. Is it possible that the device can't execute all the double precision calculations and give us NaN?
Thanks,
Sotiris |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Mar 18, 2013 12:54 pm Post subject: |
|
|
| Quote: | | We believe that this is a memory issue. Is that possible? | If you were running out of memory, the binary would abort execution. The C1060's don't have ECC memory so it could still be memory related, but I doubt it.
Try adding "-Mlarge_arrays" to your compilation. It's possible that the index calculations need to be adjusted. Granted 10k x 10k isn't that large so this may not be the issue.
What does the -Minfo messages say about how the loop is being scheduled?
| Quote: | | Another important issue is that we are using double precision variables for our code. Tesla C1060 has only 1/10 of the cores for double precision calculations. Is it possible that the device can't execute all the double precision calculations and give us NaN? | Doubtful. This just will slow you down, but shouldn't give you NaNs.
The other thing I'd look for is overflows. You might have an integer*4 variable that needs to be integer*8 or real*4 needs to be real*8. Try adding the flags "-i8 -r8" to change the default kind to the larger data types to see if that helps.
If not, then I'll need to see a reproducing example to determine the issue.
- Mat |
|
| Back to top |
|
 |
paokara
Joined: 06 Feb 2011 Posts: 19
|
Posted: Tue Mar 19, 2013 3:23 pm Post subject: |
|
|
Hi Mat and thanks for the quick reply,
Today we finally figure out what the problem was.
To avoid useless calculations we use 2 "if" statements inside our openACC regions. We have one DATA region and every N steps we copy our data back to the host. The first step of those N steps we must do the calculations inside those two IF statements i mentioned above. But in the remaining N-1 steps we don't want this calculations. The first if statement is inside a PARALLEL region and the second IF statement is inside a KERNEL region. Here is the structrure of our code
| Code: |
!$acc data copy ARRAYS
do while (for N steps)
!$acc parallel
if (flag1) then
CALCULATIONS that i want only for the first step
endif
MORE CALCULATIONS
!$acc end parallel
!$acc kernels
if (flag2) then
CALCULATIONS that i want only for the first step
endif
!$acc end kernels
!$acc parallel
MORE CALCULATIONS
!$acc end parallel
enddo
!$acc end data
|
If we use those IF statements and large ARRAYS we get NaN as results.(If we don't use them we get the correct results but 2X slower execution).
Can you guess why is that happening? Is it IFLAG1 and IFLAG2 the problem? I mean that it is possible that all the threads of the grid do not see the correct values of these scalar variables.
Thanks,
Sotiris |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Mar 20, 2013 9:39 am Post subject: |
|
|
Hi Sotiris,
By using the "parallel" construct, you are defining that everything in this region will be moved over the device. Also, it's work shared, so unless you define using the "loop" directive to tell the compiler how you want the work divided, all thread in the region will execute the same code. With the "kerenl" construct, the compiler figures out the best way to divide up the work. (FYI, this article may be useful http://www.pgroup.com/lit/articles/insider/v4n2a1.htm in understanding the differences between the two).
Without seeing the code, I can't be sure if this is the source of the NaNs, but it is possible. Do you get wrong answers if you switch to using just "kernels"?
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|