Proceedings of the 2nd conference on Computing frontiers
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Evaluating the potential of multithreaded platforms for irregular scientific computations
Proceedings of the 4th international conference on Computing frontiers
Input-independent, scalable and fast string matching on the Cray XMT
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Implementing and evaluating multithreaded triad census algorithms on the Cray XMT
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Latency Reduction via Thread Throttling
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
This paper presents an architecture for high performance computing systems specifically targeted to irregular applications. We show how a multi-core paradigm can benefit from next-generation memories and networks, while still resorting to fine-grained multi-threading for latency tolerance. At the same time, we also show how such an architecture template must employ specific techniques to optimize bandwidth utilization and achieve better scalability, proposing a mechanism based on remote memory references aggregation. We explore the proposed architecture template, using a custom simulation infrastructure, and validate its performance with three typical irregular applications. Our experimental results show the benefitsprovided by the multi-core approach, in terms of improved scalability, and by the reference aggregation technique, in terms of contention reduction and bandwidth optimization. For a configuration with 32 nodes, 8 cores and 2 memory controllers per node, the proposed bandwidth optimization technique with the best parameters achieves from 1.20 to 2.15 times higher performance and a reduction of network traffic up to 34.7% with the considered applications.