The SPARC architecture manual (version 9)
The SPARC architecture manual (version 9)
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
The predictability of data values
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Highly accurate data value prediction using hybrid predictors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors
Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Code transformations to improve memory parallelism
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Slice-processors: an implementation of operation-based prediction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Speculative precomputation: long-range prefetching of delinquent loads
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Design and evaluation of compiler algorithms for pre-execution
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Enhancing memory level parallelism via recovery-free value prediction
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Speculative Data-Driven Multithreading
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Temporal Streaming of Shared Memory
Proceedings of the 32nd annual international symposium on Computer Architecture
Techniques for Efficient Processing in Runahead Execution Engines
Proceedings of the 32nd annual international symposium on Computer Architecture
High-Performance Throughput Computing
IEEE Micro
Efficient behavior-driven runtime dynamic voltage scaling policies
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Store-Ordered Streaming of Shared Memory
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
IEEE Transactions on Computers
Store Memory-Level Parallelism Optimizations for Commercial Applications
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Chip multi-processor scalability for single-threaded applications
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Queue Usage and Memory-Level Parallelism Sensitive Scheduling
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
A case study in top-down performance estimation for a large-scale parallel application
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
A Case for MLP-Aware Cache Replacement
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Overlapping dependent loads with addressless preload
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An analysis of the effects of miss clustering on the cost of a cache miss
Proceedings of the 4th international conference on Computing frontiers
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Proceedings of the 2007 workshop on Experimental computer science
ecs'07 Experimental computer science on Experimental computer science
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor
International Journal of Parallel Programming
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Per-thread cycle accounting in SMT processors
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Memory-level parallelism aware fetch policies for simultaneous multithreading processors
ACM Transactions on Architecture and Code Optimization (TACO)
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Defining relevant distances between server workloads
Performance Evaluation
Improving memory bank-level parallelism in the presence of prefetching
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
FPGA prototyping of an amba-based windows-compatible SoC
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Studying compiler optimizations on superscalar processors through interval analysis
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Where replacement algorithms fail: a thorough analysis
Proceedings of the 7th ACM international conference on Computing frontiers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Federation: Boosting per-thread performance of throughput-oriented manycore architectures
ACM Transactions on Architecture and Code Optimization (TACO)
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Fine-grained DVFS using on-chip regulators
ACM Transactions on Architecture and Code Optimization (TACO)
Cache index-aware memory allocation
Proceedings of the international symposium on Memory management
Managing SMT resource usage through speculative instruction window weighting
ACM Transactions on Architecture and Code Optimization (TACO)
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
ACM Transactions on Computer Systems (TOCS)
MLP-Aware instruction queue resizing: the key to power-efficient performance
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)
Proceedings of the 39th Annual International Symposium on Computer Architecture
A case for exploiting subarray-level parallelism (SALP) in DRAM
Proceedings of the 39th Annual International Symposium on Computer Architecture
Predicting memcached throughput using simulation and modeling
Proceedings of the 2012 Symposium on Theory of Modeling and Simulation - DEVS Integrative M&S Symposium
APC: a performance metric of memory systems
ACM SIGMETRICS Performance Evaluation Review
Regional cache organization for NoC based many-core processors
Journal of Computer and System Sciences
Accurately modeling superscalar processor performance with reduced trace
Journal of Parallel and Distributed Computing
Exploring memory consistency for massively-threaded throughput-oriented processors
Proceedings of the 40th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese applications and that microarchitecture has a profound impacton achievable MLP. Using the epoch model of MLP, we reasonhow traditional microarchitecture features such as out-of-orderissue and state-of-the-art microarchitecture techniques suchas runahead execution affect MLP. Simulation results show that amoderately aggressive out-of-order issue processor improvesMLP over an in-order issue processor by 12-30%, and that aggressivehandling of loads, branches and serializing instructionsis needed to attain the full benefits of large out-of-order instructionwindows. The results also show that a processor's issue windowand reorder buffer should be decoupled to exploit MLP more efficiently.In addition, we demonstrate that runahead execution ishighly effective in enhancing MLP, potentially improving the MLPof the database workload by 82% and its overall performance by60%. Finally, our limit study shows that there is considerableheadroom in improving MLP and overall performance by implementingeffective instruction prefetching, more accurate branchprediction and better value prediction in addition to runahead execution.