Tarantula: a vector extension to the alpha architecture

  • Authors:
  • Roger Espasa;Federico Ardanaz;Joel Emer;Stephen Felix;Julio Gago;Roger Gramunt;Isaac Hernandez;Toni Juan;Geoff Lowney;Matthew Mattina;André Seznec

  • Affiliations:
  • Universitat Politècnica Catalunya, Barcelona, Spain;Universitat Politècnica Catalunya, Barcelona, Spain;Compaq Computer Corporation, Shrewsbury, MA;Compaq Computer Corporation, Shrewsbury, MA;Universitat Politècnica Catalunya, Barcelona, Spain;Universitat Politècnica Catalunya, Barcelona, Spain;Universitat Politècnica Catalunya, Barcelona, Spain;Universitat Politècnica Catalunya, Barcelona, Spain;Compaq Computer Corporation, Shrewsbury, MA;Compaq Computer Corporation, Shrewsbury, MA;Compaq Computer Corporation, Shrewsbury, MA

  • Venue:
  • ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw band- width. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it fully integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor, (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent "real-computation" per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.