Dark silicon and the end of multicore scaling

  • Authors:
  • Hadi Esmaeilzadeh;Emily Blem;Renee St. Amant;Karthikeyan Sankaralingam;Doug Burger

  • Affiliations:
  • University of Washington, Seattle, WA, USA;University of Wisconsin-Madison, Madison, WI, USA;The University of Texas at Austin, Austin, TX, USA;University of Wisconsin-Madison, Madison, WI, USA;Microsoft Research, Seattle, WA, USA

  • Venue:
  • Proceedings of the 38th annual international symposium on Computer architecture
  • Year:
  • 2011

Quantified Score

Hi-index 0.03

Visualization

Abstract

Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.