A Bandwidth-Optimized Multi-core Architecture for Irregular Applications

Authors:
Simone Secchi;Antonino Tumeo;Oreste Villa
Affiliations:
-;-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 8
Cited 1

ELDORADO

Proceedings of the 2nd conference on Computing frontiers
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
SeaStar Interconnect: Balanced Bandwidth for Scalable Performance

IEEE Micro
Evaluating the potential of multithreaded platforms for irregular scientific computations

Proceedings of the 4th international conference on Computing frontiers
Input-independent, scalable and fast string matching on the Cray XMT

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing and evaluating multithreaded triad census algorithms on the Cray XMT

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Latency Reduction via Thread Throttling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an architecture for high performance computing systems specifically targeted to irregular applications. We show how a multi-core paradigm can benefit from next-generation memories and networks, while still resorting to fine-grained multi-threading for latency tolerance. At the same time, we also show how such an architecture template must employ specific techniques to optimize bandwidth utilization and achieve better scalability, proposing a mechanism based on remote memory references aggregation. We explore the proposed architecture template, using a custom simulation infrastructure, and validate its performance with three typical irregular applications. Our experimental results show the benefitsprovided by the multi-core approach, in terms of improved scalability, and by the reference aggregation technique, in terms of contention reduction and bandwidth optimization. For a configuration with 32 nodes, 8 cores and 2 memory controllers per node, the proposed bandwidth optimization technique with the best parameters achieves from 1.20 to 2.15 times higher performance and a reduction of network traffic up to 34.7% with the considered applications.