|
| View previous topic :: View next topic |
| Author |
Message |
lechat
Joined: 14 Oct 2012 Posts: 10
|
Posted: Tue Oct 16, 2012 3:34 pm Post subject: error 702: Launch timeout happens non-deterministically |
|
|
(I'm using PGi 12.8 on Linux 64, with a GeForce GTX 280 and CUDA 4.1)
I'm doing some experiments with OpenACC, and this is puzzling me:
I had the following code to perform matrix multiplications:
| Code: | typedef float ff;
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
} |
This code runs well, but I'm looking to optimize it.
I then do a small transformation:
| Code: | void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
for (int is = 0; is < n; is += 32) {
#pragma acc loop independent
for (int i = is; i < is+32; ++i) {
#pragma acc loop independent
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
} |
I simply added an external for loop, but the iteration remains basically the same.
While this isn't by itself an optimization, the result is very strange: about half of the times I run this code I get the following error:
| Quote: | | call to ctxSynchronize/after/__pgi_cu_uploadx returned error 702: Launch timeout |
The other half of the times it simply runs, in about 8 seconds (for 1024x1024 sized matrices).
For smaller matrices it always works, so I suppose there might be a timeout issue here.
I'm not worried with the performance here, but want to understand this strange behaviour. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Tue Oct 16, 2012 4:23 pm Post subject: |
|
|
Hi lechat,
Let's look at the compiler feedback messages for these two loops:
| Code: | mmul:
11, Generating copyout(c[0:n*n])
Generating copyin(b[0:n*n])
Generating copyin(a[0:n*n])
14, Generating present_or_copyout(c[0:n*n])
Generating present_or_copyin(a[0:n*n])
Generating present_or_copyin(b[0:n*n])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
18, Loop is parallelizable
20, Loop is parallelizable
Accelerator kernel generated
18, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */
20, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y */
CC 1.0 : 20 registers; 64 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
22, Loop is parallelizable
mmul2:
38, Generating copyin(b[0:n*n])
Generating copyin(a[0:n*n])
Generating copy(c[0:n*n])
41, Generating present_or_copy(c[0:n*n])
Generating present_or_copyin(a[0:n*n])
Generating present_or_copyin(b[0:n*n])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
44, Complex loop carried dependence of '*(c)' prevents parallelization
Loop carried dependence of '*(c)' prevents parallelization
Loop carried backward dependence of '*(c)' prevents vectorization
Complex loop carried dependence of '*(b)' prevents parallelization
Complex loop carried dependence of '*(a)' prevents parallelization
Accelerator kernel generated
44, CC 1.0 : 20 registers; 64 shared, 8 constant, 0 local memory bytes
CC 2.0 : 22 registers; 0 shared, 80 constant, 0 local memory bytes
46, #pragma acc loop vector(32) /* threadIdx.x */
Loop is parallelizable
48, Loop is parallelizable
50, Loop is parallelizable
main:
89, Generating present_or_copyin(B[0:size][0:size])
Generating present_or_copyin(A[0:size][0:size])
|
For the first loop, you get a nice 2D gang (grid) with a 2D vector (thread block). However for the second because of the loop carried dependency (the compiler can't tell independence of computed array indices), only a single gang is used with a single 1D vector. To fix, you need to add "independent" to the outer loop and add some schedule clauses:
| Code: | #pragma acc region
{
#pragma acc loop independent gang
for (int is = 0; is < n; is += 32) {
#pragma acc loop independent vector (32)
for (int i = is; i < is+32; ++i) {
#pragma acc loop independent vector (16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
|
| Quote: |
For smaller matrices it always works, so I suppose there might be a timeout issue here. | Most likely X is killing your run. Is your GTX280 attached to a monitor?
Hope this helps,
Mat |
|
| Back to top |
|
 |
lechat
Joined: 14 Oct 2012 Posts: 10
|
Posted: Tue Oct 16, 2012 9:27 pm Post subject: |
|
|
| Yes, it helps. Thanks Mat. |
|
| Back to top |
|
 |
Neldan
Joined: 12 Feb 2013 Posts: 11
|
Posted: Thu Feb 14, 2013 4:20 am Post subject: |
|
|
i have the same problem, but my devices are not attached to display
My cuda devices are:
- GTX 580
- GTX 460
- TESLA C2075
my code:
| Code: | #pragma acc data copyin(m1[0:numFilas1][0:numColumnas1],m2[0:numFilas2][0:numColumnas2]), copyout(resultado[0:numFilas1][0:numFilas2])
{
int i,j;
#pragma omp parallel for default(shared)
#pragma acc kernels
for (i=0;i<numFilas1;i++)
{
#pragma omp parallel for
#pragma acc loop
for(j=0;j<numFilas2;j++)
{
int k = 0;
real_t acumulador = 0;
for(k=0;k<numColumnas1;k++)
acumulador += m1[i][k] * m2[j][k];
resultado[i][j] = acumulador;
}
}
}
|
my code with your suggested changes:
| Code: |
#pragma acc data copyin(m1[0:numFilas1][0:numColumnas1],m2[0:numFilas2][0:numColumnas2]), copyout(resultado[0:numFilas1][0:numFilas2])
{
#pragma acc region
{
int i,j;
#pragma omp parallel for default(shared)
#pragma acc loop independent
for (i=0;i<numFilas1;i++)
{
#pragma omp parallel for
#pragma acc loop independent
for(j=0;j<numFilas2;j++)
{
int k = 0;
real_t acumulador = 0;
for(k=0;k<numColumnas1;k++)
acumulador += m1[i][k] * m2[j][k];
resultado[i][j] = acumulador;
}
}
}
}
|
on gtx 580 and 460, the execution does fail by timeout, and only with the tesla device seems to end at 40 seconds
the data test i used are 5000x5000 sized matrix |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|