A performance comparison of contemporary DRAM architectures
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
Route packets, not wires: on-chip inteconnection networks
Proceedings of the 38th annual Design Automation Conference
Focusing processor policies via critical-path prediction
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
IEEE Transactions on Parallel and Distributed Systems
A Delay Model and Speculative Architecture for Pipelined Routers
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Modern dram architectures
Principles and Practices of Interconnection Networks
Principles and Practices of Interconnection Networks
Low-Latency Virtual-Channel Routers for On-Chip Networks
Proceedings of the 31st annual international symposium on Computer architecture
Memory Controller Optimizations for Web Servers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Exploring Fault-Tolerant Network-on-Chip Architectures
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
SPEC CPU2006 benchmark descriptions
ACM SIGARCH Computer Architecture News
Express virtual channels: towards the ideal interconnection fabric
Proceedings of the 34th annual international symposium on Computer architecture
Interconnect design considerations for large NUCA caches
Proceedings of the 34th annual international symposium on Computer architecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Age-based packet arbitration in large-radix k-ary n-cubes
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Aérgia: exploiting packet latency slack in on-chip networks
Proceedings of the 37th annual international symposium on Computer architecture
Approximating age-based arbitration in on-chip networks
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Direct distributed memory access for CMPs
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
To achieve high performance in emerging multicores, it is crucial to reduce the number of memory accesses that suffer from very high latencies. However, this should be done with care as improving latency of an access can worsen the latency of another as a result of resource sharing. Therefore, the goal should be to balance latencies of memory accesses issued by an application in an execution phase, while ensuring a low average latency value. Targeting Network-on-Chip (NoC) based multicores, we propose two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies. Our first scheme prioritizes memory response messages such that, in a given period of time, messages of an application that experience higher latencies than the average message latency for that application are expedited and a more uniform memory latency pattern is achieved. Our second scheme prioritizes the request messages that are destined for idle memory banks over others, with the goal of improving bank utilization and preventing long queues from being built in front of the memory banks. These two network prioritization-based optimizations together lead to uniform memory access latencies with a low average value. Our experiments with a 4x8 mesh network-based multicore show that, when applied together, our schemes can achieve 15%, 10% and 13% performance improvement on memory intensive, memory non-intensive, and mixed multiprogrammed workloads, respectively.