Global Multi-Threaded Instruction Scheduling

  • Authors:
  • Guilherme Ottoni;David August

  • Affiliations:
  • -;-

  • Venue:
  • Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utiliz- ing the increasing transistor counts in the face of physi- cal and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single- threaded codes, a characteristic of most applications. In or- der to support parallelization of general-purpose applica- tions, computer architects have proposed CMPs with light- weight scalar communication mechanisms [21, 23, 29]. De- spite such support, most existing compiler multi-threading techniques have generally demonstrated little effective- ness in extracting parallelism from non-scientific applica- tions [14, 15, 22]. The main reason for this is that such techniques are mostly restricted to extracting parallelism within straight-line regions of code. In this paper, we first propose a framework that en- ables global multi-threaded instruction scheduling in gen- eral. We then describe GREMIO, a scheduler built using this framework. GREMIO operates at a global scope, at the procedure level, and uses control dependence analysis to extract non-speculative thread-level parallelism from se- quential codes. Using a fully automatic compiler imple- mentation of GREMIO and a validated processor model, this paper demonstrates gains for a dual-core CMP model running a variety of codes. Our experiments demonstrate the advantage of exploiting global scheduling for multi- threaded architectures, and present gains in a detailed com- parison with the Decoupled Software Pipelining (DSWP) multi-threading technique [18]. Furthermore, our experi- ments show that adding GREMIO to a compiler with DSWP improves the average speedup from 16.5% to 32.8% for im- portant benchmark functions when utilizing two cores, indi- cating the importance of this technique in making compilers extract threads effectively.