The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Journal of the ACM (JACM)
Memory requirements for balanced computer architectures
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
A model for hierarchical memory
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
The input/output complexity of sorting and related problems
Communications of the ACM
STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Circuits and local computation
STOC '89 Proceedings of the twenty-first annual ACM symposium on Theory of computing
Tradeoffs between communication and space
STOC '89 Proceedings of the twenty-first annual ACM symposium on Theory of computing
An Evaluation of Multiple-Disk I/O Systems
IEEE Transactions on Computers
The input/output complexity of transitive closure
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Optimal disk I/O with parallel block transfer
STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Large-scale sorting in parallel memories (extended abstract)
SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
I/O Overhead and Parallel VLSI Architectures for Lattice Computations
IEEE Transactions on Computers
Deterministic distribution sort in shared and distributed memory multiprocessors
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Greed sort: optimal deterministic sorting on parallel disks
Journal of the ACM (JACM)
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Simple randomized mergesort on parallel disks
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
&mgr;Database: parallelism in a memory-mapped environment (research summary)
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
A quantitative comparison of parallel computation models
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Strategic directions in research in theory of computing
ACM Computing Surveys (CSUR) - Special ACM 50th-anniversary issue: strategic directions in computing research
Designing a Scalable Processor Array for Recurrent Computations
IEEE Transactions on Parallel and Distributed Systems
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A quantitative comparison of parallel computation models
ACM Transactions on Computer Systems (TOCS)
Graph-theoretic methods in database theory
PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
On optimal temporal locality of stencil codes
Proceedings of the 2002 ACM symposium on Applied computing
The design of I/O-efficient sparse direct solvers
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms
Journal of the ACM (JACM)
Optimizing Graph Algorithms for Improved Cache Performance
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies
ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
On the Space and Access Complexity of Computation DAGs
WG '00 Proceedings of the 26th International Workshop on Graph-Theoretic Concepts in Computer Science
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
An Analytical Evaluation of Tiling for Stencil Codes with Time Loop
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Handbook of massive data sets
A Theoretical Framework for Memory-Adaptive Algorithms
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Use of VLSI in algebraic computation: Some suggestions
SYMSAC '81 Proceedings of the fourth ACM symposium on Symbolic and algebraic computation
Domain-Specific Modeling for Rapid Energy Estimation of Reconfigurable Architectures
The Journal of Supercomputing
A fast Fourier transform compiler
ACM SIGPLAN Notices - Best of PLDI 1979-1999
On Scheduling Mesh-Structured Computations for Internet-Based Computing
IEEE Transactions on Computers
Optimizing Graph Algorithms for Improved Cache Performance
IEEE Transactions on Parallel and Distributed Systems
Guidelines for Scheduling Some Common Computation-Dags for Internet-Based Computing
IEEE Transactions on Computers
Parallel scheduling of complex dags under uncertainty
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Cache oblivious stencil computations
Proceedings of the 19th annual international conference on Supercomputing
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Cache-oblivious dynamic programming
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The cache complexity of multithreaded cache oblivious algorithms
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache-Friendly implementations of transitive closure
Journal of Experimental Algorithmics (JEA)
The memory behavior of cache oblivious stencil computations
The Journal of Supercomputing
Optimal sparse matrix dense vector multiplication in the I/O-model
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A parallel dynamic programming algorithm on a multi-core architecture
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
IEEE Transactions on Parallel and Distributed Systems
The VLSI Complexity of Sorting
IEEE Transactions on Computers
Matrix product on heterogeneous master-worker platforms
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Hierarchical memory with block transfer
SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
Combating I-O bottleneck using prefetching: model, algorithms, and ramifications
The Journal of Supercomputing
Algorithms and data structures for external memory
Foundations and Trends® in Theoretical Computer Science
A Bridging Model for Multi-core Computing
ESA '08 Proceedings of the 16th annual European symposium on Algorithms
A unified model for multicore architectures
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
On approximating the ideal random access machine by physical machines
Journal of the ACM (JACM)
Simultaneous minimization of capacity and conflict misses
Journal of Computer Science and Technology
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Cache-optimal algorithms for option pricing
ACM Transactions on Mathematical Software (TOMS)
Evaluating multicore algorithms on the unified memory model
Scientific Programming - Software Development for Multi-core Computing Systems
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Algorithmic techniques for memory energy reduction
WEA'03 Proceedings of the 2nd international conference on Experimental and efficient algorithms
Is cache-oblivious DGEMM viable?
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Cache-Oblivious Dynamic Programming for Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model
MFCS'10 Proceedings of the 35th international conference on Mathematical foundations of computer science
A bridging model for multi-core computing
Journal of Computer and System Sciences
Cache complexity and multicore implementation for univariate real root isolation
ACM Communications in Computer Algebra
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Upper and lower I/O bounds for pebbling r-pyramids
IWOCA'10 Proceedings of the 21st international conference on Combinatorial algorithms
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Brief announcement: communication bounds for heterogeneous architectures
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Balance principles for algorithm-architecture co-design
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Strong I/O lower bounds for binomial and FFT computation graphs
COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
Performance modeling for systematic performance tuning
State of the Practice Reports
ACM Transactions on Algorithms (TALG)
Communication-optimal Parallel and Sequential Cholesky Decomposition
SIAM Journal on Scientific Computing
Algorithmic ramifications of prefetching in memory hierarchy
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
The potential of on-chip multiprocessing for QCD machines
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
The i/o complexity of sparse matrix dense matrix multiplication
LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
A family of high-performance matrix multiplication algorithms
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A pebble game for internet-based computing
Theoretical Computer Science
Upper and lower I/O bounds for pebbling r-pyramids
Journal of Discrete Algorithms
On the communication complexity of 3D FFTs and its implications for Exascale
Proceedings of the 26th ACM international conference on Supercomputing
Space-round tradeoffs for MapReduce computations
Proceedings of the 26th ACM international conference on Supercomputing
Cache-conscious scheduling of streaming applications
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Avoiding communication through a multilevel LU factorization
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A lower bound technique for communication on BSP with application to the FFT
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Graph expansion analysis for communication costs of fast rectangular matrix multiplication
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Tight bounds for low dimensional star stencils in the external memory model
WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Communication costs of Strassen's matrix multiplication
Communications of the ACM
Hi-index | 0.05 |
In this paper, the red-blue pebble game is proposed to model the input-output complexity of algorithms. Using the pebble game formulation, a number of lower bound results for the I/O requirement are proven. For example, it is shown that to perform the n-point FFT or the ordinary n×n matrix multiplication algorithm with O(S) memory, at least &Ohgr;(n log n/log S) or &Ohgr;(n3/@@@@S), respectively, time is needed for the I/O. Similar results are obtained for algorithms for several other problems. All of the lower bounds presented are the best possible in the sense that they are achievable by certain decomposition schemes. Results of this paper may provide insight into the difficult task of balancing I/O and computation in special-purpose system designs. For example, for the n-point FFT, the lower bound on I/O time implies that an S-point device achieving a speed-up ratio of order log S over the conventional O(n log n) time implementation is all one can hope for.