Communication optimizations for global multi-threaded instruction scheduling
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Limits of parallelism using dynamic dependency graphs
WODA '09 Proceedings of the Seventh International Workshop on Dynamic Analysis
Single thread program parallelism with dataflow abstracting thread
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Hi-index | 0.00 |
Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utiliz- ing the increasing transistor counts in the face of physi- cal and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single- threaded codes, a characteristic of most applications. In or- der to support parallelization of general-purpose applica- tions, computer architects have proposed CMPs with light- weight scalar communication mechanisms [21, 23, 29]. De- spite such support, most existing compiler multi-threading techniques have generally demonstrated little effective- ness in extracting parallelism from non-scientific applica- tions [14, 15, 22]. The main reason for this is that such techniques are mostly restricted to extracting parallelism within straight-line regions of code. In this paper, we first propose a framework that en- ables global multi-threaded instruction scheduling in gen- eral. We then describe GREMIO, a scheduler built using this framework. GREMIO operates at a global scope, at the procedure level, and uses control dependence analysis to extract non-speculative thread-level parallelism from se- quential codes. Using a fully automatic compiler imple- mentation of GREMIO and a validated processor model, this paper demonstrates gains for a dual-core CMP model running a variety of codes. Our experiments demonstrate the advantage of exploiting global scheduling for multi- threaded architectures, and present gains in a detailed com- parison with the Decoupled Software Pipelining (DSWP) multi-threading technique [18]. Furthermore, our experi- ments show that adding GREMIO to a compiler with DSWP improves the average speedup from 16.5% to 32.8% for im- portant benchmark functions when utilizing two cores, indi- cating the importance of this technique in making compilers extract threads effectively.