PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

win32 api issue and unified binary question
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
vam



Joined: 11 Dec 2012
Posts: 4

PostPosted: Wed May 29, 2013 2:25 pm    Post subject: win32 api issue and unified binary question Reply with quote

Hello Mat,

I have an issue with the code below and a few questions about the unified binary technology.

- The code below compiles and runs when no optimisation is enabled. However when -O or higher is enabled the program hangs on startup. I believe it is related to the 'StringCchPrintf' function (when commented out, no problems). _tcsprintf also has the same issue. Is there a way to solve it?

- How many simultaneous targets are supported with the -tp compiler option? I have the impression that when alot of targets are specified, for some no specific code is generated.

- How does the 'pragma routine tp' directive work? In the example below
for the 'smooth' -function, it is set to "#pragma routine tp p7 k8 core2 sandybridge" in an effort to generate target specific routine code for four different targets. However the compiler output only reports generated code for penryn (which wasn't even specified). Can you explain what I'm doing wrong? Using pgi 13.4 with windows.

Thanks,

Ruben

Code:

pgcpp main.cpp kernel32.lib user32.lib gdi32.lib -Minfo -fastsse


Code:

        5362, PGI Unified Binary version for -tp=penryn-64
WinMain:
     51, Loop not vectorized/parallelized: contains call
smooth__FPfT1fN23iN26:
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times



Code:

#include <windows.h>
#include <Strsafe.h>
#include <math.h>

#ifndef WINVER
#define WINVER 0x0502
#endif

// Declare the main WndProc prototype
LRESULT CALLBACK Main_WndProc(HWND, UINT, WPARAM, LPARAM);

//
// Define the WinMain
//
int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, PSTR szCmdLine, int iCmdShow)
{
   static TCHAR szWndClassName[] = TEXT("WinApp1");   // Window classname to be used
   HWND hwnd;                                 // handle to main window
   MSG msg;                                 // message from message queue
   WNDCLASS wndclass;                            // our main window class
   
   // Define window class
   wndclass.style         = CS_HREDRAW | CS_VREDRAW;      // redraw (WM_PAINT)on H or V resize
   wndclass.lpfnWndProc    = Main_WndProc;
   wndclass.cbClsExtra      = 0;
   wndclass.cbWndExtra       = 0 ;
   wndclass.hIcon            = LoadIcon (NULL, IDI_APPLICATION) ;
   wndclass.hCursor          = LoadCursor (NULL, IDC_ARROW) ;
   wndclass.hbrBackground    = (HBRUSH) GetStockObject (WHITE_BRUSH) ;
   wndclass.lpszMenuName     = NULL ;
   wndclass.lpszClassName    = szWndClassName;
   // register windowclass
   if (!RegisterClass (&wndclass))
   {
      MessageBox (NULL, TEXT ("Program requires Windows NT!"), szWndClassName, MB_ICONERROR) ;
          return 0 ;
   }
   
   // Create the window
   hwnd = CreateWindow(szWndClassName, TEXT("Test WinApp1"),             // LPCTSTR lpClassName, LPCTSTR lpWindowName,
                  WS_OVERLAPPEDWINDOW | WS_VSCROLL | WS_HSCROLL,       // DWORD dwStyle,
                  CW_USEDEFAULT, CW_USEDEFAULT,                  // int x, int y,
                  CW_USEDEFAULT, CW_USEDEFAULT,                  // int width, int height,
                  NULL, NULL, hInstance, NULL);                  // HWND hWndparent, HMENU hMenu, HINSTANCE hInstance, LPVOID lpParam
                  
   // Show and repaint window immediately (WM_PAINT)
   ShowWindow(hwnd, iCmdShow);
   UpdateWindow(hwnd);
   
   // Start message pump
   while (GetMessage(&msg, NULL, 0, 0) > 0)      // LPMSG lpMsg, HWND hWnd, UINT wMsgFilterMin, UINT wMsgFilterMax
   {
      TranslateMessage(&msg);
      DispatchMessage(&msg);
   }

   return msg.wParam;   // return Quit code?
}

//
// Main  Window Procedure
//
LRESULT CALLBACK Main_WndProc (HWND hwnd, UINT msg, WPARAM wParam, LPARAM lParam)
{
   HDC hdc;   
   PAINTSTRUCT ps;   
   float sTemp = 3.1415;
   TCHAR szBuffer[20]=TEXT("Empty");   
   
   switch(msg)
   {   
   case WM_CREATE:
      // Get device context
      hdc = GetDC (hwnd);      
      
      ReleaseDC (hwnd, hdc);
      return 0;
      
   case WM_PAINT:      
      // Get device context
      hdc = BeginPaint (hwnd, &ps) ;
      
      // Problematic function
      StringCchPrintf(szBuffer, 20, TEXT("%f"), sTemp);
      
      // Paint text
      TextOut(hdc, 0, 0, szBuffer, lstrlen(szBuffer));

      EndPaint (hwnd, &ps) ;
      return 0 ;       

   case WM_DESTROY:
      PostQuitMessage (0) ;
      return 0 ;

   // Default handler
   default: return DefWindowProc (hwnd, msg, wParam, lParam);
   }
}

#pragma routine tp p7 k8 core2 sandybridge
void smooth( float* a, float* b, float w0, float w1, float w2, int n, int m, int niters )
{
    int i, j, iter;
    float* tmp;
    for( iter = 1; iter <= niters; ++iter ){
   #pragma acc kernels loop copyin(b[0:n*m]) copy(a[0:n*m]) independent
   for( i = 1; i < n-1; ++i )
       for( j = 1; j < m-1; ++j )
      a[i*m+j] = w0 * b[i*m+j] +
          w1*(b[(i-1)*m+j] + b[(i+1)*m+j] + b[i*m+j-1] + b[i*m+j+1]) +
          w2*(b[(i-1)*m+j-1] + b[(i-1)*m+j+1] + b[(i+1)*m+j-1] + b[(i+1)*m+j+1]);
   tmp = a;  a = b;  b = tmp;
    }
}
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Thu May 30, 2013 11:29 am    Post subject: Reply with quote

Hi vam,

I had forgotten that we even had a "tp" pragma and from what I can tell our engineers had too. It doesn't appear to have been updated in many years so only contains older targets. I'll submit a bug report, but since your the first person to use this feature, my guess is we will just remove it.

Instead, use the "-tp" command switch with multiple 64-bit targets to create a Unified Binary (32-bit targets are not supported) You can put as many targets as you wish.

Code:
$ pgcpp main.cpp kernel32.lib user32.lib gdi32.lib -Minfo -fastsse -tp=sandybridge-64,bulldozer-64,core2-64
StringCchPrintfA__FPcULPCce:
   5365, [local to main_cpp]::StringValidateDestA(const char *, unsigned long long, unsigned long long) inlined, size=4 (inline) file main.cpp (10220)
   5373, [local to main_cpp]::StringVPrintfWorkerA(char *, unsigned long long, unsigned long long *, const char *, char *) inlined, size=15 (inline) file main.cpp (10572)
WinMain:
     16, PGI Unified Binary version for -tp=sandybridge-64
     51, Loop not vectorized/parallelized: contains call
WinMain:
     16, PGI Unified Binary version for -tp=bulldozer-64
     51, Loop not vectorized/parallelized: contains call
WinMain:
     16, PGI Unified Binary version for -tp=core2-64
     51, Loop not vectorized/parallelized: contains call
Main_WndProc__FP6HWND__UiULL:
     64, PGI Unified Binary version for -tp=sandybridge-64
Main_WndProc__FP6HWND__UiULL:
     64, PGI Unified Binary version for -tp=bulldozer-64
Main_WndProc__FP6HWND__UiULL:
     64, PGI Unified Binary version for -tp=core2-64
smooth__FPfT1fN23iN26:
    103, PGI Unified Binary version for -tp=sandybridge-64
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times
smooth__FPfT1fN23iN26:
    103, PGI Unified Binary version for -tp=bulldozer-64
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times
smooth__FPfT1fN23iN26:
    103, PGI Unified Binary version for -tp=core2-64
    109, Loop not vectorized: data dependency
         Loop unrolled 2 times
main.cpp:


- Mat
Back to top
View user's profile
vam



Joined: 11 Dec 2012
Posts: 4

PostPosted: Fri May 31, 2013 12:33 am    Post subject: Reply with quote

Hi Mat,

thank you for your reply.

- Would you know why the "StringCchPrintf" might cause hanging on application startup? It's part of the WIN32 api (a safer replacement for _stprintf which seems to have the same problem). If you don't enable optimization, the code runs and shows a simple window. If optimization is enabled, no window is shown and the process remains in task manager.

- How does the unified binary dispatch code work? Do you check for a certain vendor string or processor architecture, or is it based on the supported feature set of the processor? If a code is optimized for piledriver, sandybridge and generic x64, what path will for instance an Ivybridde or future processor take? Will it use AVX if available?

IMHO your unified binary technology is one of the strongest points of the compiler, since it's able to specifically optimize for both Intel and AMD (whereas the Intel compiler just optimizes for Intel only and use a less-optimized path for non-Intel chips). AFAIK, they check for their own vendor string and then select a codepath based on supported instruction set. Processors with other vendor strings just run the slower path.
I'd like to understand how the PGI compiler selects its code path and if there are generic recommendations/remarks to make your applications run as good as possible on future architectures.

Thank you.
Best regards,

Ruben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Fri May 31, 2013 11:27 am    Post subject: Reply with quote

Hi Ruben,

Quote:
Would you know why the "StringCchPrintf" might cause hanging on application startup? It's part of the WIN32 api (a safer replacement for _stprintf which seems to have the same problem). If you don't enable optimization, the code runs and shows a simple window. If optimization is enabled, no window is shown and the process remains in task manager.
Not off hand and your program seems to work for me at high opt.

Are you sure it's hanging on "StringCchPrintf"? One thing that comes to mind is that early releases of Win7 didn't support AVX instructions and would cause a program to hang on start-up. To test this, compile with optimization targeting Penryn (i.e. -fast -tp=penryn-64). The solution is to install Win7 SP1.

Quote:
How does the unified binary dispatch code work? Do you check for a certain vendor string or processor architecture, or is it based on the supported feature set of the processor? If a code is optimized for piledriver, sandybridge and generic x64, what path will for instance an Ivybridde or future processor take? Will it use AVX if available?
At start-up the run time checks the feature set of the processor and then select the appropriate code path. So yes, Ivybridge would use AVX. You can use the utility "pgcpuid" to see the feature list for your processor.

Quote:
IMHO your unified binary technology is one of the strongest points of the compiler, since it's able to specifically optimize for both Intel and AMD
Thank you. It's one of the advantages of being independent. We don't play favorites and are more interest in getting the fastest performance across all targets.

Quote:
I'd like to understand how the PGI compiler selects its code path and if there are generic recommendations/remarks to make your applications run as good as possible on future architectures.
When we first implemented Unified Binary, there was a wide performance difference in running a binary built for architecture than another. For example, running an Intel targeted binary on AMD hardware would slow the code by 20% versus a native targeted binary. Though as the x86-64 architectures have matured, this is less of a problem. Now the main issue is portable binaries that can take advantage of new instructions (like AVX) on system which support them, but still run on older processors.

- Mat
Back to top
View user's profile
vam



Joined: 11 Dec 2012
Posts: 4

PostPosted: Tue Jun 04, 2013 1:24 pm    Post subject: Reply with quote

Hi Mat,

thanks for your answers.

Quote:

Not off hand and your program seems to work for me at high opt.

Just to verify, did you compile using pgcc or pgcpp? If I remember correctly pgcc also ran ok for me, but the hanging occured with the C++ compiler.

Thank you.
Best regards,

Ruben
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group