Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

  • Authors:
  • Venkatraman Govindaraju;Tony Nowatzki;Karthikeyan Sankaralingam

  • Affiliations:
  • University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA

  • Venue:
  • PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern microprocessors exploit data level parallelism through in-core data-parallel accelerators in the form of short vector ISA extensions such as SSE/AVX and NEON. Although these ISA extensions have existed for decades, compilers do not generate good quality, high-performance vectorized code without significant programmer intervention and manual optimization. The fundamental problem is that the architecture is too rigid, which overly complicates the compiler's role and simultaneously restricts the types of codes that the compiler can profitably map to these data-parallel accelerators. We take a fundamentally new approach that first makes the architecture more flexible and exposes this flexibility to the compiler. Counter-intuitively, increasing the complexity of the accelerator's interface to the compiler enables a more robust and efficient system that supports many types of codes. This system also enables the performance of auto-acceleration to be comparable to that of manually-optimized implementations. To address the challenges of compiling for flexible accelerators, we propose a variant of Program Dependence Graph called the Access Execute Program Dependence Graph to capture spatio-temporal aspects of memory accesses and computations. We implement a compiler that uses this representation and evaluate it by considering both a suite of kernels developed and tuned for SSE, and "challenge" data-parallel applications, the Parboil benchmarks. We show that our compiler, which targets the DySER accelerator, provides high-quality code for the kernels and full applications, commonly reaching within 30% of manually-optimized and out-performs compiler-produced SSE code by 1.8 times.