The wavescalar architecture

  • Authors:
  • Mark Oskin;Steven Swanson

  • Affiliations:
  • University of Washington;University of Washington

  • Venue:
  • The wavescalar architecture
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Silicon technology will continue to provide an exponential increase in the availability of raw transistors. Effectively translating this resource into application performance, however, is an open challenge that conventional superscalar designs will not be able to meet. We present WaveScalar as a scalable alternative to conventional designs. WaveScalar is a dataflow instruction set and execution model designed for scalable, low-complexity, high-performance processors. Unlike previous dataflow machines, WaveScalar can efficiently provide the sequential memory semantics imperative languages require. To allow programmers to easily express parallelism, WaveScalar supports pthread-style, coarse-grain multithreading and dataflow-style, fine-grain threading. In addition, it permits blending the two styles within an application or even a single function. To execute WaveScalar programs, we have designed a scalable, tile-based processor architecture called the WaveCache. As a program executes, the WaveCache maps the program's instructions onto its array of processing elements (PEs). The instructions remain at their processing elements for many invocations, and as the working set of instructions changes, the WaveCache removes unused instructions and maps new instructions in their place. The instructions communicate directly with one-another over a scalable, hierarchical on-chip interconnect, obviating the need for long wires and broadcast communication. This thesis presents the WaveScalar instruction set and evaluates a simulated implementation based on current technology. For single-threaded applications, the WaveCache achieves performance on par with conventional processors, but in less area. For coarse-grain threaded applications, WaveCache performance scales with chip size over a wide range, and it outperforms a range of the multi-threaded designs. The WaveCache sustains 7-14 multiply-accumulates per cycle on fine-grain threaded versions of well-known kernels. Finally, we apply both styles of threading to an example application, equake from spec2000, and speed it up by 9× compared to the serial version.