Compilers and More: The Past, Present and Future of Parallel Loops (Addendum)

In the interests of space, I left out discussion of several parallel loop types. For historical completeness (or at least more completeness), I include them here.

Thread-Centric Parallel Loops

There were several early thread-centric parallel programming models, predecessors to OpenMP. Among them was The Force, developed by Harry Jordan at Colorado using a Fortran preprocessor. The Force was what we now call an SPMD programming model (singel program, multiple data), multiple cores or processors executing the same program. I believe The Force was the first programming model to use the term barrier for a synchronization across all the processes. The processes in a Force program executed code redundantly until they reached an identified parallel loop, where the iterations are spread across the processes.

Cray also used a Fortran preprocessor to deliver what they called microtasking for the Cray X-MP and Cray 2 systems. A subroutine where microtasking is used is identified with a directive and invoked redundantly on each processor. Local variables were private to each processor, while arguments and globals were shared. The iterations of a parallel are spread across the processors. Cray used very fast interprocessor synchronization registers to implement the primitives.

One other notable parallel loop was the Sequent Parallel Processing directives. The Sequent Balance had up to 30 32-bit microprocessors with shared memory. Sequent introduced the doacross directive to identify loops where the iterations should be shared across the processors. It also supported synchronization across loop iterations. The machine was inexpensive enough to be acquired and used by a number of colleges and universities for research and instruction in parallel computing, and the Sequent Guide to Parallel Programming was adopted as a required or recommended text for many courses. The Sequent directives were also adopted by SGI when they developed a multiprocessor workstation.

There were a number of vendor-specific sets of parallel loop directives, provided by IBM, Encore, Alliant, Convex, and others. This mismatch of spellings and implementations led to a customer revolt. Important HPC customers essentially told the vendors to get together and figure out a way to spell parallel do the same way across all machines.

PCF parallel do

In 1987, an effort was initiated to unify the various parallel Fortran extensions. This was known as the Parallel Computing Forum, or PCF. After several years, the group produced a PCF Parallel Fortran Extensions document. Many of the concepts and phrases in this document live on in the OpenMP specification. Like its successor, OpenMP, the iterations of a PCF parallel do loop are divided among the members of a parallel team. PCF Fortran had no explicit support for reductions; instead it had critical sections and locks to allow a loop iteration to accumulate a value.

    parallel do i = 1, n
       row = rowndx(i)
       ncols = rowndx(i+1) - rowndx(i)
       val = 0.0
       do j = 1, ncols
         icol = colndx(row+j)
         val = val + a(row+j) * v(icol)
       enddo
       r(i) = val
    end parallel do

Comments:

True Parallel Loops

doall

A true parallel loop is often called a doall. The original doall was a parallel loop construct defined around 1980 for the proposed Burroughs Flow Model Processor, where each iteration of the parallel loop is data-independent of each other iterations. This allows all iterations to run in parallel, or in any order. A Fortran doall construct might look like:

    doall i = 1:N
       row = rowndx(i)
       ncols = rowndx(i+1) - rowndx(i)
       val = 0.0
       do j = 1, ncols
         icol = colndx(row+j)
         val = val + a(row+j) * v(icol)
       enddo
       r(i) = val
    end doall

Comments:

Myrias pardo

The Myrias SPS-2 parallel computer had a novel implementation of parallel loops using the microprocessor paging system. Each loop iteration was treated as a task, and each task saw a copy of the data as it was at the beginning of the loop. In a sequential loop, if one loop iteration wrote to an array element, and a different loop iteration reads that element, the read will get the original value if the read occurs in an earlier iteration, and will get the updated value if the read occurs in a later iteration. In a parallel loop, this is a data conflict between iterations. The conflict can be resolved in many ways (schedule the two iterations to the same thread, add synchronization) or it can result in a data race (unpredictable results depending on runtime behavior). In the Myrias system, the hardware and software cooperated so such a conflict was always resolved so the read will always get the original value, regardless of the order of iterations. The implementation of the pardo mechanism is fascinating, especially since it used off-the-shelf Motorola 68000 microprocessors and no special hardware. But our focus is on the pardo language construct.

Comments: