Memory requirements for balanced computer architectures
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
FFTs in external or hierarchical memory
The Journal of Supercomputing
Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods
Volume rendering on scalable shared-memory MIMD architectures
VVS '92 Proceedings of the 1992 workshop on Volume visualization
Massively Parallel Linpack Benchmark on the Intel Touchstone Delta andIPSC/860 Systems (Progress Report)
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Modeling communication in parallel algorithms: a fruitful interaction between theory and systems?
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Exploiting cache affinity in software cache coherence
ICS '94 Proceedings of the 8th international conference on Supercomputing
An approach to scalability study of shared memory parallel systems
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Exploring the design space for a shared-cache multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Impact of sharing-based thread placement on multithreaded architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Future applicability of bus-based shared memory multiprocessors
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
On characterizing bandwidth requirements of parallel applications
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
I/O limitations in parallel molecular dynamics
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Techniques for reducing overheads of shared-memory multiprocessing
ICS '95 Proceedings of the 9th international conference on Supercomputing
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Application and architectural bottlenecks in large scale distributed shared memory machines
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements
IEEE Transactions on Parallel and Distributed Systems
Memory organization in multi-channel optical networks: NUMA and COMA revisited
ICS '96 Proceedings of the 10th international conference on Supercomputing
A performance study of cosmological simulations on message-passing and shared-memory multiprocessors
ICS '96 Proceedings of the 10th international conference on Supercomputing
Adaptive page replacement based on memory reference behavior
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Toward a More Realistic Performance Evaluation of Interconnection Networks
IEEE Transactions on Parallel and Distributed Systems
Performance implications of communication mechanisms in all-software global address space systems
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving parallel shear-warp volume rendering on shared address space multiprocessors
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The design and analysis of a cache architecture for texture mapping
Proceedings of the 24th annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA
Proceedings of the 24th annual international symposium on Computer architecture
A methodology and an evaluation of the SGI Origin2000
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
An Application-Driven Study of Parallel System Overheads and Network Bandwidth Requirements
IEEE Transactions on Parallel and Distributed Systems
Ace: a language for parallel programming with customizable protocols
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Computers
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
An analytical model of the working-set sizes in decision-support systems
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
WOSP '02 Proceedings of the 3rd international workshop on Software and performance
Improving Memory Performance for Indirect Accesses on SIMD Computers
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Abstracting network characteristics and locality properties of parallel systems
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
The impact of shared-cache clustering in small-scale shared-memory multiprocessors
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Configuration Independent Analysis for Characterizing Shared-Memory Applications
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
IEEE Transactions on Computers
Dynamic tracking of page miss ratio curve for memory management
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The implications of working set analysis on supercomputing memory hierarchy design
Proceedings of the 19th annual international conference on Supercomputing
A page fault equation for modeling the effect of memory size
Performance Evaluation
IEEE Transactions on Computers
Program locality analysis using reuse distance
ACM Transactions on Programming Languages and Systems (TOPLAS)
Generalized ERSS tree model: Revisiting working sets
Performance Evaluation
TSV: A novel energy efficient Memory Integrity Verification scheme for embedded systems
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.01 |
The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.