Computational power of pipelined memory hierarchies

Authors:
Gianfranco Bilardi;Kattamuri Ekanadham;Pratap Pattnaik
Affiliations:
Dip. Elettronica e Informatica, Università di Padova, Padova, Italy;T.J. Watson Research Center, IBM, Yorktown Heights, NY;T.J. Watson Research Center, IBM, Yorktown Heights, NY
Venue:
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Year:
2001

Citing 10
Cited 4

A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Cache and memory hierarchy design: a performance-directed approach

Cache and memory hierarchy design: a performance-directed approach
Horizons of parallel computation

Journal of Parallel and Distributed Computing
Guest Editors' Introduction-Cache Memory and Related Problems: Enhancing and Exploiting the Locality

IEEE Transactions on Computers - Special issue on cache memory and related problems
Dynamic storage allocation in the Atlas computer, including an automatic use of a backing store

Communications of the ACM
Models of Computation: Exploring the Power of Computing

Models of Computation: Exploring the Power of Computing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Predicting Performance on SMPs. A Case Study: The SGI Power Challenge

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
An Approach towards an Analytical Characterization of Locality and its Portability

IWIA '01 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01)

Optimal organizations for pipelined hierarchical memories

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
An Address Dependence Model of Computation for Hierarchical Memories with Pipelined Transfer

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 8 - Volume 09
On approximating the ideal random access machine by physical machines

Journal of the ACM (JACM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We define a model of computation, called the Pipelined Hierarchical Random Access Machine with access function a (x), denoted the a(x)-PH-RAM. In this model, a processor interacts with a memory which can accept requests at a constant rate and satisfy each of the requests to the location x within a(x) units of time.We investigate memory management strategies that lead to time efficient implementations of arbitrary computations on a PH-RAM. We begin by developing the so called pipeline d decomposition-treememory management strategy, which can be tuned to the memory access function. Specifically, for a linear or sublinear access function a(x), w e define the concept of latency-hiding depth da(x) and show ho w an y computation of N operations can be implemented on an a(x)-PH-RAM in time T(N) = &Ogr;(Nda(N)). In particular, T(N) = &Ogr;(N log N) if a(x) = &Ogr;(x), T(N) = &Ogr;(N log log N) if a(x) = &Ogr;(x&Bgr;) with 0 &Bgr; T(N) = O(N log* N) if a(x) = &Ogr;(log x).We develop lower bound techniques that allow to establish existential lower bounds on PH-RAMs. In particular, we exhibit computations for which T(N) = &OHgr;(Nlog N/ log log N) when a(x) = &OHgr;(x), T(N) = &OHgr;(Nlog logN) when a(x) = &OHgr;(x&Bgr;) with 0 &Bgr; T(N) = &OHgr;(N log* N) when a(x) = &OHgr;(log x).The stated lower bounds show that the pipelined decomposition-tree strategy is existentially optimal for the latter case but indicates the potential for a modest, &Ogr;(log log N) improvement for linear access functions. To realize this potential, a superpipelined decomposition-tree memory manager is proposed, which achieves T(N) = &Ogr;(N log N/log log N).The pipelined decomposition-tree strategy can also be tuned to the computation, in order to exploit its temporal locality as characterized by the width parameters [9]. When the latter are suitably bounded, then T(N) = &Ogr;(N) on any PH-RAM with linear or sublinear access function. Finally, we discuss how performance could benefit from parallelism in the data-dependence dag of the computation or from architectural enhancements, such as block-transfer primitives, and formulate various questions that deserve further investigation.