Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Authors:
Alexander Heinecke;Carsten Trinitis;Josef Weidendorfer
Affiliations:
Technische Universität München, München, Germany;Technische Universität München, München, Germany;Technische Universität München, München, Germany
Venue:
Proceedings of the 7th ACM international conference on Computing frontiers
Year:
2010

Citing 7
Cited 0

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Towards many-core implementation of LU decomposition using Peano Curves

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Cache oblivious matrix operations using Peano curves

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.