Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Authors:
Yuan Chou;Brian Fahs;Santosh Abraham
Affiliations:
Sun Microsystems, Sunnyvale, CA;Sun Microsystems, Sunnyvale, CA;Sun Microsystems, Sunnyvale, CA
Venue:
Proceedings of the 31st annual international symposium on Computer architecture
Year:
2004

Citing 21
Cited 53

The SPARC architecture manual (version 9)

The SPARC architecture manual (version 9)
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
The predictability of data values

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Highly accurate data value prediction using hybrid predictors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Code transformations to improve memory parallelism

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Enhancing memory level parallelism via recovery-free value prediction

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture

Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
Techniques for Efficient Processing in Runahead Execution Engines

Proceedings of the 32nd annual international symposium on Computer Architecture
High-Performance Throughput Computing

IEEE Micro
Efficient behavior-driven runtime dynamic voltage scaling policies

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

IEEE Transactions on Computers
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Chip multi-processor scalability for single-threaded applications

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Queue Usage and Memory-Level Parallelism Sensitive Scheduling

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

IEEE Micro
A case study in top-down performance estimation for a large-scale parallel application

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
A Case for MLP-Aware Cache Replacement

Proceedings of the 33rd annual international symposium on Computer Architecture
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An analysis of the effects of miss clustering on the cost of a cache miss

Proceedings of the 4th international conference on Computing frontiers
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Pipeline spectroscopy

Proceedings of the 2007 workshop on Experimental computer science
Pipeline spectroscopy

ecs'07 Experimental computer science on Experimental computer science
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Providing platform heterogeneity-awareness for data center power management

Cluster Computing
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Per-thread cycle accounting in SMT processors

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Memory-level parallelism aware fetch policies for simultaneous multithreading processors

ACM Transactions on Architecture and Code Optimization (TACO)
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Defining relevant distances between server workloads

Performance Evaluation
Improving memory bank-level parallelism in the presence of prefetching

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
FPGA prototyping of an amba-based windows-compatible SoC

Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Studying compiler optimizations on superscalar processors through interval analysis

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Where replacement algorithms fail: a thorough analysis

Proceedings of the 7th ACM international conference on Computing frontiers
Efficient runahead threads

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Quantifying and reducing the effects of wrong-path memory references in cache-coherent multiprocessor systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Fine-grained DVFS using on-chip regulators

ACM Transactions on Architecture and Code Optimization (TACO)
Cache index-aware memory allocation

Proceedings of the international symposium on Memory management
Managing SMT resource usage through speculative instruction window weighting

ACM Transactions on Architecture and Code Optimization (TACO)
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multicore Memory Systems

ACM Transactions on Computer Systems (TOCS)
MLP-Aware instruction queue resizing: the key to power-efficient performance

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Proceedings of the 39th Annual International Symposium on Computer Architecture
A case for exploiting subarray-level parallelism (SALP) in DRAM

Proceedings of the 39th Annual International Symposium on Computer Architecture
Predicting memcached throughput using simulation and modeling

Proceedings of the 2012 Symposium on Theory of Modeling and Simulation - DEVS Integrative M&S Symposium
APC: a performance metric of memory systems

ACM SIGMETRICS Performance Evaluation Review
Regional cache organization for NoC based many-core processors

Journal of Computer and System Sciences
Accurately modeling superscalar processor performance with reduced trace

Journal of Parallel and Distributed Computing
Exploring memory consistency for massively-threaded throughput-oriented processors

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese applications and that microarchitecture has a profound impacton achievable MLP. Using the epoch model of MLP, we reasonhow traditional microarchitecture features such as out-of-orderissue and state-of-the-art microarchitecture techniques suchas runahead execution affect MLP. Simulation results show that amoderately aggressive out-of-order issue processor improvesMLP over an in-order issue processor by 12-30%, and that aggressivehandling of loads, branches and serializing instructionsis needed to attain the full benefits of large out-of-order instructionwindows. The results also show that a processor's issue windowand reorder buffer should be decoupled to exploit MLP more efficiently.In addition, we demonstrate that runahead execution ishighly effective in enhancing MLP, potentially improving the MLPof the database workload by 82% and its overall performance by60%. Finally, our limit study shows that there is considerableheadroom in improving MLP and overall performance by implementingeffective instruction prefetching, more accurate branchprediction and better value prediction in addition to runahead execution.