Cache-Oblivious Algorithms

Authors:
Matteo Frigo;Charles E. Leiserson;Harald Prokop;Sridhar Ramachandran
Affiliations:
-;-;-;-
Venue:
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Year:
1999

Citing 26
Cited 250

Amortized efficiency of list update and paging rules

Communications of the ACM
A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
The input/output complexity of sorting and related problems

Communications of the ACM
Fast fourier transforms: a tutorial review and a state of the art

Signal Processing
Introduction to algorithms

Introduction to algorithms
FFTs in external or hierarchical memory

The Journal of Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Large-scale sorting in uniform memory hierarchies

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
Deterministic distribution sort in shared and distributed memory multiprocessors

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Randomized algorithms

Randomized algorithms
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The influence of caches on the performance of sorting

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures

External memory algorithms
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Extending the Hong-Kung Model to Memory Hierarchies

COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
Towards an Optimal Bit-Reversal Permutation Program

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Hierarchical memory with block transfer

SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
Uniform memory hierarchies

SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science
A study of replacement algorithms for a virtual-storage computer

IBM Systems Journal

Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Computational power of pipelined memory hierarchies

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Cache-oblivious priority queue and graph algorithm applications

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Computation regrouping: restructuring programs for temporal data cache locality

ICS '02 Proceedings of the 16th international conference on Supercomputing
A locality-preserving cache-oblivious dynamic dictionary

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Cache oblivious search trees via binary trees of small height

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Optimal organizations for pipelined hierarchical memories

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
High-Performance Algorithm Engineering for Computational Phylogenetics

The Journal of Supercomputing - Special issue on computational issues in fluid dynamics optimization and simulation
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
The set-associative cache performance of search trees

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
High-Performance Algorithm Engineering for Computational Phylogenetics

ICCS '01 Proceedings of the International Conference on Computational Science-Part II
Optimizing Graph Algorithms for Improved Cache Performance

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Exponential Structures for Efficient Cache-Oblivious Algorithms

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Cache Oblivious Distribution Sweeping

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Funnel Heap - A Cache Oblivious Priority Queue

ISAAC '02 Proceedings of the 13th International Symposium on Algorithms and Computation
Automatic Generation of Block-Recursive Codes

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Optimised Predecessor Data Structures for Internal Memory

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
From Algorithm to Program to Software Library

Informatics - 10 Years Back. 10 Years Ahead.
Analysing the Cache Behaviour of Non-uniform Distribution Sorting Algorithms

ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
Efficient Tree Layout in a Multilevel Memory Hierarchy

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Cache-oblivious data structures for orthogonal range searching

Proceedings of the nineteenth annual symposium on Computational geometry
External memory data structures

Handbook of massive data sets
External memory algorithms

Handbook of massive data sets
On the limits of cache-obliviousness

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Algorithm engineering for parallel computation

Experimental algorithmics
A comparison of cache aware and cache oblivious static search trees using program instrumentation

Experimental algorithmics
Efficient sorting using registers and caches

Journal of Experimental Algorithmics (JEA)
Adapting Radix Sort to the Memory Hierarchy

Journal of Experimental Algorithmics (JEA)
Proximity Mergesort: optimal in-place sorting in the cache-oblivious model

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate minimum enclosing balls in high dimensions using core-sets

Journal of Experimental Algorithmics (JEA)
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Cache-oblivious shortest paths in graphs using buffer heap

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Restructuring computations for temporal data cache locality

International Journal of Parallel Programming
Implicit B-trees: a new data structure for the dictionary problem

Journal of Computer and System Sciences - Special issue on FOCS 2002
Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems
Interactive Exploration of Large Remote Micro-CT Scans

VIS '04 Proceedings of the conference on Visualization '04
A performance study of data layout techniques for improving data locality in refinement-based pathfinding

Journal of Experimental Algorithmics (JEA)
A locality-preserving cache-oblivious dynamic dictionary

Journal of Algorithms
Cache-oblivious planar orthogonal range searching and counting

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Cache-oblivious R-trees

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
External-memory exact and approximate all-pairs shortest-paths in undirected graphs

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Cache-oblivious mesh layouts

ACM SIGGRAPH 2005 Papers
Concurrent cache-oblivious b-trees

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Think globally, search locally

Proceedings of the 19th annual international conference on Supercomputing
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
Implicit dictionaries with O(1) modifications per update and fast search

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Cache-oblivious string dictionaries

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Cache-oblivious dynamic programming

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
A computational study of external-memory BFS algorithms

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Improving locality with parallel hierarchical copying GC

Proceedings of the 5th international symposium on Memory management
Simple and semi-dynamic structures for cache-oblivious planar orthogonal range searching

Proceedings of the twenty-second annual symposium on Computational geometry
Architecture-conscious hashing

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
Cache-oblivious string B-trees

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The cache-oblivious gaussian elimination paradigm: theoretical framework and experimental evaluation

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Fast additions on masked integers

ACM SIGPLAN Notices
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Cache-conscious frequent pattern mining on modern and emerging processors

The VLDB Journal — The International Journal on Very Large Data Bases
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Cache-oblivious nested-loop joins

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A data structure for a sequence of string accesses in external memory

ACM Transactions on Algorithms (TALG)
Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache-Friendly implementations of transitive closure

Journal of Experimental Algorithmics (JEA)
Linear work suffix array construction

Journal of the ACM (JACM)
Engineering a cache-oblivious sorting algorithm

Journal of Experimental Algorithmics (JEA)
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
The memory behavior of cache oblivious stencil computations

The Journal of Supercomputing
Cache oblivious algorithms for nonserial polyadic programming

The Journal of Supercomputing
Models for parallel and hierarchical computation

Proceedings of the 4th international conference on Computing frontiers
On the importance of cache tuning in a cache-aware algorithm: A case study

Computers & Mathematics with Applications
Balanced allocation and dictionaries with tightly packed constant size bins

Theoretical Computer Science
Optimal sparse matrix dense vector multiplication in the I/O-model

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Cache-oblivious streaming B-trees

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
HAT-trie: a cache-conscious trie-based data structure for strings

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Skewed binary search trees

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Multithreaded programming in Cilk

Proceedings of the 2007 international workshop on Parallel symbolic computation
Making deterministic signatures quickly

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A faster cache-oblivious shortest-path algorithm for undirected graphs with bounded edge lengths

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
Alternation and redundancy analysis of the intersection problem

ACM Transactions on Algorithms (TALG)
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A general framework for improving query processing performance on multi-level memory hierarchies

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Cache-oblivious databases: Limitations and opportunities

ACM Transactions on Database Systems (TODS)
An experimental study of sorting and branch prediction

Journal of Experimental Algorithmics (JEA)
Terracost: Computing least-cost-path surfaces for massive grid terrains

Journal of Experimental Algorithmics (JEA)
On searching compressed string collections cache-obliviously

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Combating I-O bottleneck using prefetching: model, algorithms, and ramifications

The Journal of Supercomputing
On the limits of cache-oblivious rational permutations

Theoretical Computer Science
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Massive model visualization techniques: course notes

ACM SIGGRAPH 2008 classes
Computing visibility on terrains in external memory

Journal of Experimental Algorithmics (JEA)
I/O Efficient Dynamic Data Structures for Longest Prefix Queries

SWAT '08 Proceedings of the 11th Scandinavian workshop on Algorithm Theory
A Bridging Model for Multi-core Computing

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Cache-Oblivious Red-Blue Line Segment Intersection

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems

International Journal of Computational Science and Engineering
Cache-oblivious selection in sorted X+Y matrices

Information Processing Letters
B-tries for disk-based string management

The VLDB Journal — The International Journal on Very Large Data Bases
Making deterministic signatures quickly

ACM Transactions on Algorithms (TALG)
Optimal in-place algorithms for 3-D convex hulls and 2-D segment intersection

Proceedings of the twenty-fifth annual symposium on Computational geometry
Cache-oblivious range reporting with optimal queries requires superlinear space

Proceedings of the twenty-fifth annual symposium on Computational geometry
A general approach for cache-oblivious range reporting and approximate range counting

Proceedings of the twenty-fifth annual symposium on Computational geometry
Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions

Proceedings of the 18th ACM international symposium on High performance distributed computing
On approximating the ideal random access machine by physical machines

Journal of the ACM (JACM)
Assembling approximately optimal binary search trees efficiently using arithmetics

Information Processing Letters
Lists revisited: Cache-conscious STL lists

Journal of Experimental Algorithmics (JEA)
Design and Engineering of External Memory Traversal Algorithms for General Graphs

Algorithmics of Large and Complex Networks
On Cartesian Trees and Range Minimum Queries

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
Brief announcement: low depth cache-oblivious sorting

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Masking patterns in sequences: A new class of motif discovery with don't cares

Theoretical Computer Science
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Multiprocessor System-on-Chip designs with active memory processors for higher memory efficiency

Proceedings of the 46th Annual Design Automation Conference
Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)
Sorting on architecturally diverse computer systems

Proceedings of the Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications
Improved visibility computation on massive grid terrains

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Evaluating multicore algorithms on the unified memory model

Scientific Programming - Software Development for Multi-core Computing Systems
Engineering burstsort: Toward fast in-place string sorting

Journal of Experimental Algorithmics (JEA)
Randomized multi-pass streaming skyline algorithms

Proceedings of the VLDB Endowment
Star-quadtrees and guard-quadtrees: I/O-efficient indexes for fat triangulations and low-density planar subdivisions

Computational Geometry: Theory and Applications
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Bureaucratic protocols for secure two-party sorting, selection, and permuting

ASIACCS '10 Proceedings of the 5th ACM Symposium on Information, Computer and Communications Security
Optimal cache-oblivious implicit dictionaries

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
The worst page-replacement policy

FUN'07 Proceedings of the 4th international conference on Fun with algorithms
Is cache-oblivious DGEMM viable?

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
On the limits of cache-oblivious matrix transposition

TGC'06 Proceedings of the 2nd international conference on Trustworthy global computing
I/O-efficient map overlay and point location in low-density subdivisions

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
Engineering burstsort: towards fast in-place string sorting

WEA'08 Proceedings of the 7th international conference on Experimental algorithms
Characterizing the performance of flash memory storage devices and its impact on algorithm design

WEA'08 Proceedings of the 7th international conference on Experimental algorithms
I/O-efficient point location in a set of rectangles

LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics
Cache-oblivious ray reordering

ACM Transactions on Graphics (TOG)
Cache-oblivious hashing

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Pruning spanners and constructing well-separated pair decompositions in the presence of memory hierarchies

Journal of Discrete Algorithms
Static reuse distances for locality-based optimizations in MATLAB

Proceedings of the 24th ACM International Conference on Supercomputing
Basic network creation games

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
The Cilkview scalability analyzer

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Optimal in-place and cache-oblivious algorithms for 3-d convex hulls and 2-d segment intersection

Computational Geometry: Theory and Applications
A general approach for cache-oblivious range reporting and approximate range counting

Computational Geometry: Theory and Applications
Balanced dense polynomial multiplication on multi-cores

ACM Communications in Computer Algebra
Parallel computation of the minimal elements of a poset

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Cache-oblivious polygon indecomposability testing

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Cache friendly sparse matrix-vector multiplication

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions

Cluster Computing
Cache-Oblivious Dynamic Programming for Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Fast field solver for the simulation of large-area OLEDs

Microelectronics Journal
Additive spanners and (α, β)-spanners

ACM Transactions on Algorithms (TALG)
Engineering scalable, cache and space efficient tries for strings

The VLDB Journal — The International Journal on Very Large Data Bases
Cache-oblivious dynamic dictionaries with update/query tradeoffs

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
ISB-tree: A new indexing scheme with efficient expected behaviour

Journal of Discrete Algorithms
Resource oblivious sorting on multicores

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Fast prefix search in little space, with applications

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
A bridging model for multi-core computing

Journal of Computer and System Sciences
Cache-oblivious simulation of parallel programs

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Improving locality of nonserial polyadic dynamic programming

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Algorithm engineering: bridging the gap between algorithm theory and practice

Algorithm engineering: bridging the gap between algorithm theory and practice
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
Dynamic z-fast tries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Cache complexity and multicore implementation for univariate real root isolation

ACM Communications in Computer Algebra
Cache friendly sparse matrix-vector multiplication

ACM Communications in Computer Algebra
Patterns for cache optimizations on multi-processor machines

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Three layer cake for shared-memory programming

Proceedings of the 2010 Workshop on Parallel Programming Patterns
autopin: automated optimization of thread-to-core pinning on multicore systems

Transactions on high-performance embedded architectures and compilers III
Cache-oblivious index for approximate string matching

Theoretical Computer Science
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Data-oblivious external-memory algorithms for the compaction, selection, and sorting of outsourced data

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Stratified B-trees and versioned dictionaries

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Proceedings of the 8th ACM International Conference on Computing Frontiers
On the weak prefix-search problem

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
Two-dimensional cache-oblivious sparse matrix-vector multiplication

Parallel Computing
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
Algorithmic ramifications of prefetching in memory hierarchy

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Cache-Optimal data-structures for hierarchical methods on adaptively refined space-partitioning grids

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Lists revisited: cache conscious STL lists

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
I/O-efficient data structures for colored range and prefix reporting

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
External string sorting: faster and cache-oblivious

STACS'06 Proceedings of the 23rd Annual conference on Theoretical Aspects of Computer Science
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Multi-DaC programming model: a variant of multi-BSP model for divide-and-conquer algorithms

DAMP '12 Proceedings of the 7th workshop on Declarative aspects and applications of multicore programming
Cache-oblivious planar shortest paths

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Cache-aware and cache-oblivious adaptive sorting

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Cache-oblivious comparison-based algorithms on multisets

ESA'05 Proceedings of the 13th annual European conference on Algorithms
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation

Journal of Scientific Computing
FFT-based dense polynomial arithmetic on multi-cores

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
ISB-tree: a new indexing scheme with efficient expected behaviour

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Improved space bounds for cache-oblivious range reporting

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A paradigm for parallel matrix algorithms: scalable cholesky

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A cache-aware algorithm for PDEs on hierarchical data structures

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Fast and cache-oblivious dynamic programming with local dependencies

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Revisiting the cache miss analysis of multithreaded algorithms

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
On the communication complexity of 3D FFTs and its implications for Exascale

Proceedings of the 26th ACM international conference on Supercomputing
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Brief announcement: efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Parallel and I/O efficient set covering algorithms

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Sorting on GPUs for large scale datasets: A thorough comparison

Information Processing and Management: an International Journal
CloudIQ: a framework for processing base stations in a data center

Proceedings of the 18th annual international conference on Mobile computing and networking
In-place heap construction with optimized comparisons, moves, and cache misses

MFCS'12 Proceedings of the 37th international conference on Mathematical Foundations of Computer Science
Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

ACM SIGMETRICS Performance Evaluation Review
How to achieve scalable fork/join on many-core architectures?

Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity
Cache-efficient parallel isosurface extraction for shared cache multicores

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Parallel construction of k-nearest neighbor graphs for point clouds

SPBG'08 Proceedings of the Fifth Eurographics / IEEE VGTC conference on Point-Based Graphics
Aspen: a domain specific language for performance modeling

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Communication-avoiding parallel strassen: implementation and performance

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cache-oblivious index for approximate string matching

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Avoiding communication through a multilevel LU factorization

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An energy complexity model for algorithms

Proceedings of the 4th conference on Innovations in Theoretical Computer Science
Cache and I/O efficent functional algorithms

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Cache-Oblivious dictionaries and multimaps with negligible failure probability

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
The power 775 architecture at scale

Proceedings of the 27th international ACM conference on International conference on supercomputing
On the weak prefix-search problem

Theoretical Computer Science
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Program-centric cost models for locality

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Parallel triangle counting in massive streaming graphs

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Modeling synthetic aperture radar computation with Aspen

International Journal of High Performance Computing Applications
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
X-Stream: edge-centric graph processing using streaming partitions

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
A generic high-performance method for deinterleaving scientific data

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Efficient parallel and external matching

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Scalability study of molecular dynamics simulation on Godson-T many-core architecture

Journal of Parallel and Distributed Computing
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Measurement of the latency parameters of the Multi-BSP model: a multicore benchmarking approach

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where \math the number of cache misses for an \math matrix transpose is \math. The number of cache misses for either an n-point FFT or the sorting of n numbers is \math. We also give an \math-work algorithm to multiply an \math matrix by an \math matrix that incurs \math cache faults.We introduce an `ideal-cache' model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice.