Quantitative system performance: computer system analysis using queueing network models
Quantitative system performance: computer system analysis using queueing network models
Memory requirements for balanced computer architectures
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
The input/output complexity of sorting and related problems
Communications of the ACM
Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
A bridging model for parallel computation
Communications of the ACM
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Modeling the benefits of mixed data and task parallelism
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Programming parallel algorithms
Communications of the ACM
The Parallel Evaluation of General Arithmetic Expressions
Journal of the ACM (JACM)
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Towards an energy complexity of computation
Information Processing Letters - Special issue in honor of Edsger W. Dijkstra
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Communications of the ACM - Voting systems
Communication lower bounds for distributed-memory matrix multiplication
Journal of Parallel and Distributed Computing
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A metric space for computer programs and the principle of computational least action
The Journal of Supercomputing
3D-Stacked Memory Architectures for Multi-core Processors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A Bridging Model for Multi-core Computing
ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Amdahl's Law in the Multicore Era
Computer
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
Analysis of Parallel Algorithms for Energy Conservation in Scalable Multicore Architectures
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Low depth cache-oblivious algorithms
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A quantitative performance analysis model for GPU architectures
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
On the communication complexity of 3D FFTs and its implications for Exascale
Proceedings of the 26th ACM international conference on Supercomputing
How much (execution) time and energy does my algorithm cost?
XRDS: Crossroads, The ACM Magazine for Students - Scientific Computing
Hi-index | 0.00 |
We consider the problem of "co-design," by which we mean the problem of how to design computational algorithms for particular hardware architectures and vice-versa. Our position is that balance principles should drive the co-design process. A balance principle is a theoretical constraint equation that explicitly relates algorithm parameters to hardware parameters according to some figure of merit, such as speed, power, or cost. This notion originates in the work of Kung (1986); Callahan, Cocke, and Kennedy (1988); and McCalpin (1995); however, we reinterpret these classical notions of balance in a modern context of parallel and I/O-efficient algorithm design as well as trends in emerging architectures. From such a principle, we argue that one can better understand algorithm and hardware trends, and furthermore gain insight into how to improve both algorithms and hardware. For example, we suggest that although matrix multiply is currently compute-bound, it will in fact become memory-bound in as few as ten years--even if last-level caches grow at their current rates. Our overall aim is to suggest how to co-design rigorously and quantitatively while still yielding intuition and insight.