The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
Introduction to algorithms
Limits of control flow on parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Communication optimization and code generation for distributed memory machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Static branch frequency and program profile analysis
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Global communication analysis and optimization
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A global communication optimization technique based on data-flow analysis and linear algebra
ACM Transactions on Programming Languages and Systems (TOPLAS)
Modulo scheduling with integrated register spilling for clustered VLIW architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Compiler optimization of scalar value communication between speculative threads
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Master/slave speculative parallelization
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Field-testing IMPACT EPIC research results in Itanium 2
Proceedings of the 31st annual international symposium on Computer architecture
Decoupled Software Pipelining with the Synchronization Array
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
IEEE Transactions on Parallel and Distributed Systems
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Chip multi-processor scalability for single-threaded applications
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A framework for unrestricted whole-program optimization
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Support for High-Frequency Streaming in CMPs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Speculative Decoupled Software Pipelining
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Global Multi-Threaded Instruction Scheduling
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A profile-based tool for finding pipeline parallelism in sequential programs
Parallel Computing
An automatic thread decomposition approach for pipelined multithreading
International Journal of High Performance Computing and Networking
Hi-index | 0.00 |
The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for parallelism beyond local regions of code. To exploit these opportunities, especially for sequential applications, researchers have recently proposed global multi-threaded instruction scheduling techniques, including DSWP and GREMIO. These techniques simultaneously schedule instructions from large regions of code, such as arbitrary loop nests or whole procedures, and have been shown to be effective at extracting threads for many applications. A key enabler of these global instruction scheduling techniques is the Multi-Threaded Code Generation (MTCG) algorithm proposed in [16], which generates multi-threaded code for any partition of the instructions into threads. This algorithm inserts communication and synchronization instructions in order to satisfy all inter-thread dependences. In this paper, we present a general compiler framework, COCO, to optimize the communication and synchronization instructions inserted by the MTCG algorithm. This framework, based on thread-aware data-flow analyses and graph min-cut algorithms, appropriately models andoptimizes all kinds of inter-thread dependences, including register, memory, and control dependences. Our experiments, using a fully automatic compiler implementation of these techniques, demonstrate significant reductions (about 30% on average) in the number of dynamic communication instructions in code parallelized with DSWP and GREMIO. This reduction in communication translates to performance gains of up to 40%.