Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
A performance comparison of contemporary DRAM architectures
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Analysis and Optimization of Disk Storage Devices for Time-Sharing Systems
Journal of the ACM (JACM)
Implementation of precise interrupts in pipelined processors
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A comparative analysis of disk scheduling policies
Communications of the ACM
Dynamic Access Ordering for Streamed Computations
IEEE Transactions on Computers
Symbiotic jobscheduling for a simultaneous multithreaded processor
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Modern dram architectures
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Adaptive History-Based Memory Schedulers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory Controller Optimizations for Web Servers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
A study of performance impact of memory controller features in multi-processor server environment
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
A Case for MLP-Aware Cache Replacement
Proceedings of the 33rd annual international symposium on Computer Architecture
Fairness and Throughput in Switch on Event Multithreading
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Framework for instruction-level tracing and analysis of program executions
Proceedings of the 2nd international conference on Virtual execution environments
QoS policies and architecture for cache/memory in CMP platforms
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Effective Management of DRAM Bandwidth in Multicore Processors
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
A Burst Scheduling Access Reordering Mechanism
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Memory performance attacks: denial of memory service in multi-core systems
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Distributed order scheduling and its application to multi-core dram controllers
Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A light-weight fairness mechanism for chip multiprocessor memory systems
Proceedings of the 6th ACM conference on Computing frontiers
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices
Proceedings of the 36th annual international symposium on Computer architecture
Achieving predictable performance through better memory controller placement in many-core CMPs
Proceedings of the 36th annual international symposium on Computer architecture
Complexity effective memory access scheduling for many-core accelerator architectures
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Coordinated control of multiple prefetchers in multi-core systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Improving memory bank-level parallelism in the presence of prefetching
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An analytical model to exploit memory task scheduling
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
PIRATE: QoS and performance management in CMP architectures
ACM SIGMETRICS Performance Evaluation Review
Aérgia: exploiting packet latency slack in on-chip networks
Proceedings of the 37th annual international symposium on Computer architecture
Rethinking DRAM design and organization for energy-constrained multi-cores
Proceedings of the 37th annual international symposium on Computer architecture
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Contention-Aware Scheduling on Multicore Systems
ACM Transactions on Computer Systems (TOCS)
Software-hardware cooperative DRAM bank partitioning for chip multiprocessors
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
EEEP: an extreme end to end flow control protocol for SDRAM access through networks on chip
Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
Memory Latency Reduction via Thread Throttling
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs
Journal of Parallel and Distributed Computing
Memory systems in the many-core era: challenges, opportunities, and solution directions
Proceedings of the international symposium on Memory management
METE: meeting end-to-end QoS in multicores through system-wide resource management
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Predictive coordination of multiple on-chip resources for chip multiprocessors
Proceedings of the international conference on Supercomputing
Poster: revisiting virtual channel memory for performance and fairness on multi-core architecture
Proceedings of the international conference on Supercomputing
Memory power management via dynamic voltage/frequency scaling
Proceedings of the 8th ACM international conference on Autonomic computing
Prefetch-aware shared resource management for multi-core systems
Proceedings of the 38th annual international symposium on Computer architecture
Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput
Proceedings of the 38th annual international symposium on Computer architecture
DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip
Proceedings of the 38th annual international symposium on Computer architecture
METE: meeting end-to-end QoS in multicores through system-wide resource management
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Bandwidth constrained coordinated HW/SW prefetching for multicores
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Optimal memory controller placement for chip multiprocessor
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Improving System Energy Efficiency with Memory Rank Subsetting
ACM Transactions on Architecture and Code Optimization (TACO)
Hierarchical memory scheduling for multimedia MPSoCs
Proceedings of the International Conference on Computer-Aided Design
A fair thread-aware memory scheduling algorithm for chip multiprocessor
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A register-file approach for row buffer caches in die-stacked DRAMs
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Parallel application memory scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Computer Systems (TOCS)
BWS: balanced work stealing for time-sharing multicores
Proceedings of the 7th ACM european conference on Computer Systems
SRP: symbiotic resource partitioning of the memory hierarchy in CMPs
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
DIEF: an accurate interference feedback mechanism for chip multiprocessor memory systems
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Topology-Aware quality-of-service support in highly integrated chip multiprocessors
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Proceedings of the 49th Annual Design Automation Conference
Rank idle time prediction driven last-level cache writeback
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Trace-driven simulation of memory system scheduling in multithread application
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Providing fairness on shared-memory multiprocessors via process scheduling
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Multiple sub-row buffers in DRAM: unlocking performance and energy improvement opportunities
Proceedings of the 26th ACM international conference on Supercomputing
Unified memory optimizing architecture: memory subsystem control with a unified predictor
Proceedings of the 26th ACM international conference on Supercomputing
PARDIS: a programmable memory controller for the DDRx interfacing standards
Proceedings of the 39th Annual International Symposium on Computer Architecture
Improving writeback efficiency with decoupled last-write prediction
Proceedings of the 39th Annual International Symposium on Computer Architecture
A case for exploiting subarray-level parallelism (SALP) in DRAM
Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Dynamic QoS management for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Optimizing datacenter power with memory system levers for guaranteed quality-of-service
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
HaLock: hardware-assisted lock contention detection in multithreaded applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A software memory partition approach for eliminating bank-level interference in multicore systems
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
Toward on-chip datacenters: a perspective on general trends and on-chip particulars
The Journal of Supercomputing
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Addressing End-to-End Memory Access Latency in NoC-Based Multicores
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Conservative row activation to improve memory power efficiency
Proceedings of the 27th international ACM conference on International conference on supercomputing
Holistic run-time parallelism management for time and energy efficiency
Proceedings of the 27th international ACM conference on International conference on supercomputing
CMP off-chip bandwidth scheduling guided by instruction criticality
Proceedings of the 27th international ACM conference on International conference on supercomputing
Conservative open-page policy for mixed time-criticality memory controllers
Proceedings of the Conference on Design, Automation and Test in Europe
Improving memory scheduling via processor-side load criticality information
Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Reducing memory access latency with asymmetric DRAM bank organizations
Proceedings of the 40th Annual International Symposium on Computer Architecture
A heterogeneous multiple network-on-chip design: an application-aware approach
Proceedings of the 50th Annual Design Automation Conference
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures
ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Distributed fair DRAM scheduling in network-on-chips architecture
Journal of Systems Architecture: the EUROMICRO Journal
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Meeting midway: improving CMP performance with memory-side prefetching
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Coordinate page allocation and thread group for improving main memory power efficiency
Proceedings of the Workshop on Power-Aware Computing and Systems
A programmable memory controller for the DDRx interfacing standards
ACM Transactions on Computer Systems (TOCS)
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture
Journal of Parallel and Distributed Computing
REF: resource elasticity fairness with sharing incentives for multiprocessors
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
ACM Transactions on Architecture and Code Optimization (TACO)
Virtual machine consolidation based on interference modeling
The Journal of Supercomputing
Hi-index | 0.00 |
In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a shared DRAM system, requests from athread can not only delay requests from other threads by causingbank/bus/row-buffer conflicts but they can also destroy other threads’DRAM-bank-level parallelism. Requests whose latencies would otherwisehave been overlapped could effectively become serialized. As aresult both fairness and system throughput degrade, and some threadscan starve for long time periods.This paper proposes a fundamentally new approach to designinga shared DRAM controller that provides quality of service to threads,while also improving system throughput. Our parallelism-aware batchscheduler (PAR-BS) design is based on two key ideas. First, PARBSprocesses DRAM requests in batches to provide fairness and toavoid starvation of requests. Second, to optimize system throughput,PAR-BS employs a parallelism-aware DRAM scheduling policythat aims to process requests from a thread in parallel in the DRAMbanks, thereby reducing the memory-related stall-time experienced bythe thread. PAR-BS seamlessly incorporates support for system-levelthread priorities and can provide different service levels, includingpurely opportunistic service, to threads with different priorities.We evaluate the design trade-offs involved in PAR-BS and compareit to four previously proposed DRAM scheduler designs on 4-, 8-, and16-core systems. Our evaluations show that, averaged over 100 4-coreworkloads, PAR-BS improves fairness by 1.11X and system throughputby 8.3% compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritizationrules, PAR-BS is also simpler to implement than STFM.