A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Towards many-core implementation of LU decomposition using Peano Curves
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Cache oblivious matrix operations using Peano curves
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Hi-index | 0.00 |
Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.