The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Authors:
Vijay S. Pai;Parthasarathy Ranganathan;Hazim Abdel-Shafi;Sarita Adve
Affiliations:
Rice Univ., Houston, TX;Rice Univ., Houston, TX;Rice Univ., Houston, TX;Rice Univ., Houston, TX
Venue:
IEEE Transactions on Computers - Special issue on cache memory and related problems
Year:
1999

Citing 13
Cited 6

SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Adaptive and integrated data cache prefetching for shared-memory multiprocessors

Adaptive and integrated data cache prefetching for shared-memory multiprocessors
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Evaluation of design alternatives for a multiprocessor microprocessor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The interaction of software prefetching with ILP processors in shared-memory systems

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture

Code transformations to improve memory parallelism

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Variability in the execution of multimedia applications and implications for architecture

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Memory-side prefetching for linked data structures for processor-in-memory systems

Journal of Parallel and Distributed Computing
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP-based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication.