Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays
IEEE Transactions on Computers
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Time Optimal Linear Schedules for Algorithms with Uniform Dependencies
IEEE Transactions on Computers
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The quickhull algorithm for convex hulls
ACM Transactions on Mathematical Software (TOMS)
Loop tiling for parallelism
The parallel execution of DO loops
Communications of the ACM
On parallelization of UET/UET-UCT loops
Neural, Parallel & Scientific Computations
Compiling Tiled Iteration Spaces for Clusters
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
A novel modular systolic array architecture for full-search block matching motion estimation
IEEE Transactions on Circuits and Systems for Video Technology
Hi-index | 0.00 |
This paper describes Cronus, a platform for parallelizing general nested loops. General nested loops contain complex loop bodies (assignments, conditionals, repetitions) and exhibit uniform loop-carried dependencies. The novelty of Cronus is twofold: (1) it determines the optimal scheduling hyperplane using the QuickHull algorithm, which is more efficient than previously used methods, and (2) it implements a simple and efficient dynamic rule (successive dynamic scheduling) for the runtime scheduling of the loop iterations along the optimal hyperplane. This scheduling policy enhances data locality and improves the makespan. Cronus provides an efficient runtime library, specifically designed for communication minimization, that performs better than more generic systems, such as Berkeley UPC. Its performance was evaluated through extensive testing. Three representative case studies are examined: the Floyd-Steinberg dithering algorithm, the Transitive Closure algorithm, and the FSBM motion estimation algorithm. The experimental results corroborate the efficiency of the parallel code. The tests show speedup ranging from 1.18 (out of the ideal 4) to 12.29 (out of the ideal 16) on distributed-systems and 3.60 (out of 4) to 15.79 (out of 16) on shared-memory systems. Cronus outperforms UPC by 5-95% depending on the test case.