Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

Authors:
Edward Rothberg;Jaswinder Pal Singh;Anoop Gupta
Affiliations:
-;-;-
Venue:
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Year:
1993

Citing 8
Cited 46

Memory requirements for balanced computer architectures

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
FFTs in external or hierarchical memory

The Journal of Supercomputing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
Parallel hierarchical N-body methods

Parallel hierarchical N-body methods
Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Scaling Parallel Programs for Multiprocessors: Methodology and Examples

Computer
Massively Parallel Linpack Benchmark on the Intel Touchstone Delta andIPSC/860 Systems (Progress Report)

Massively Parallel Linpack Benchmark on the Intel Touchstone Delta andIPSC/860 Systems (Progress Report)

An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Modeling communication in parallel algorithms: a fruitful interaction between theory and systems?

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Exploiting cache affinity in software cache coherence

ICS '94 Proceedings of the 8th international conference on Supercomputing
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
An approach to scalability study of shared memory parallel systems

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Exploring the design space for a shared-cache multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Impact of sharing-based thread placement on multithreaded architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Future applicability of bus-based shared memory multiprocessors

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
On characterizing bandwidth requirements of parallel applications

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
I/O limitations in parallel molecular dynamics

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The benefits of clustering in shared address space multiprocessors: an applications-driven investigation

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Techniques for reducing overheads of shared-memory multiprocessing

ICS '95 Proceedings of the 9th international conference on Supercomputing
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements

IEEE Transactions on Parallel and Distributed Systems
Memory organization in multi-channel optical networks: NUMA and COMA revisited

ICS '96 Proceedings of the 10th international conference on Supercomputing
A performance study of cosmological simulations on message-passing and shared-memory multiprocessors

ICS '96 Proceedings of the 10th international conference on Supercomputing
Adaptive page replacement based on memory reference behavior

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Toward a More Realistic Performance Evaluation of Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
Performance implications of communication mechanisms in all-software global address space systems

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving parallel shear-warp volume rendering on shared address space multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The design and analysis of a cache architecture for texture mapping

Proceedings of the 24th annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
A methodology and an evaluation of the SGI Origin2000

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
An Application-Driven Study of Parallel System Overheads and Network Bandwidth Requirements

IEEE Transactions on Parallel and Distributed Systems
Ace: a language for parallel programming with customizable protocols

ACM Transactions on Computer Systems (TOCS)
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
Architectural requirements and scalability of the NAS parallel benchmarks

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
An analytical model of the working-set sizes in decision-support systems

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
Trends in Shared Memory Multiprocessing

Computer
Improving Memory Performance for Indirect Accesses on SIMD Computers

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Abstracting network characteristics and locality properties of parallel systems

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
The impact of shared-cache clustering in small-scale shared-memory multiprocessors

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Configuration Independent Analysis for Characterizing Shared-Memory Applications

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation

IEEE Transactions on Computers
Dynamic tracking of page miss ratio curve for memory management

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The implications of working set analysis on supercomputing memory hierarchy design

Proceedings of the 19th annual international conference on Supercomputing
A page fault equation for modeling the effect of memory size

Performance Evaluation
On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications

IEEE Transactions on Computers
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
Generalized ERSS tree model: Revisiting working sets

Performance Evaluation
TSV: A novel energy efficient Memory Integrity Verification scheme for embedded systems

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.