A Case Study of Implementing Supernode Transformations

Authors:
Johann Steinbrecher;Cesar J. Philippidis;Weijia Shang
Affiliations:
Computer Engineering Department, Santa Clara University, Santa Clara, USA 95053;Computer Engineering Department, Santa Clara University, Santa Clara, USA 95053;Computer Engineering Department, Santa Clara University, Santa Clara, USA 95053
Venue:
International Journal of Parallel Programming
Year:
2014

Citing 27
Cited 0

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Time Optimal Linear Schedules for Algorithms with Uniform Dependencies

IEEE Transactions on Computers
Communication-minimal tiling of uniform dependence loops

Journal of Parallel and Distributed Computing
Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
Algorithms for the Longest Common Subsequence Problem

Journal of the ACM (JACM)
ILP versus TLP on SMT

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Introduction to algorithms

Introduction to algorithms
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
A Survey of Longest Common Subsequence Algorithms

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
A cost-driven compilation framework for speculative parallelization of sequential programs

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Loop Scheduling for Multithreaded Processors

PARELEC '04 Proceedings of the international conference on Parallel Computing in Electrical Engineering
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Scheduling of Nested Loops with Uniform Dependencies in Heterogeneous Networks ofWorkstations

ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)

Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation)
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Communication-Aware Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Parallelization of the dynamic programming algorithm for solving the longest common subsequence problem

AICCSA '10 Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010
Performance of multi-process and multi-thread processing on multi-core SMT processors

IISWC '10 Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Supernode transformation is a technique to decrease the communication overhead by partitioning and scheduling a loop nest to a multi-processor system. This is achieved by grouping a number of iterations in a perfectly nested loop with regular dependences as a $$supernode$$ supernode . Previous work has been focusing on finding the optimal supernode size and shape as well as an optimal execution schedule for multi-processor systems with unbounded resources. This paper emphasizes on the actual implementation strategies of supernode transformations on multi-core systems with limited resources. Using an example, the longest common subsequence (LCS) problem, we present and compare three different multithreading implementations. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12-core and a 4-core machine. On the 12-core machine our first technique, which yields increased data locality, speeds up the unaltered sequential loop nest 16.7 times. Combining this technique with skewing the loop by changing the linear schedule scores a 42.6 speedup. A more sophisticated method that executes entire rows of the loop nest in one thread scores a 59.5 speedup. Concepts presented and discussed in this paper on the LCS problem serve as basic foundation for implementations at regular dependence algorithms.