Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Architecture-cognizant divide and conquer algorithms
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The memory behavior of cache oblivious stencil computations
The Journal of Supercomputing
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
CUDA-level performance with python-level productivity for Gaussian mixture model applications
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Hi-index | 0.00 |
This paper presents the generic program approach to achieving portable high-performance. This approach has three phases. In the first, a generic program, defining a family of semantically-equivalent program variants, is written. In the second, the generic program as specialized to the variant that performs best on an abstract model of the target computer. In the third, this variant is translated to run on the target computer. The Parallel Memory Hierarchy (PMH) generic model is used to define the abstract models of target computers. Using this approach, a spectrum of solutions is possible. At one end of the spectrum, a simple generic program can be written, with roughly the same difficulty as writing a sequential program, that can be tuned automatically to achieve reasonably good performance on a wide variety of computers. This solution can be refined to give better performance. At the labor-intensive end of the spectrum, an application can be tuned so that it achieves the best possible performance on each of a collection of computers.