ICS '90 Proceedings of the 4th international conference on Supercomputing
A comparative study of arbitration algorithms for the Alpha 21364 pipelined router
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Proceedings of the 34th annual international symposium on Computer architecture
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Experience on optimizing irregular computation for memory hierarchy in manycore architecture
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Programming the Intel 80-core network-on-a-chip terascale processor
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture
PDCAT '08 Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
StreamTMC: Stream compilation for tiled multi-core architectures
Journal of Parallel and Distributed Computing
The Journal of Supercomputing
Hi-index | 0.00 |
Moore's Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this work, we intend to identity the key architecture mechanisms and software optimizations to guarantee high performance for multithreaded programs. To illustrate this, we customize a dense matrix multiplication algorithm on Godson-T MCP as a case study to demonstrate the efficient synergy and interaction between hardware and software. Experiments conducted on the cycle-accurate simulator show that the optimized matrix multiplication could obtain 97.1% (124.3GFLOPS) of the peak performance of Godson-T.