Performance modelling and optimization of memory access on cellular computer architecture cyclops64

Authors:
Yanwei Niu;Ziang Hu;Kenneth Barner;Guang R. Gao
Affiliations:
Department of ECE, University of Delaware, Newark, DE;Department of ECE, University of Delaware, Newark, DE;Department of ECE, University of Delaware, Newark, DE;Department of ECE, University of Delaware, Newark, DE
Venue:
NPC'05 Proceedings of the 2005 IFIP international conference on Network and Parallel Computing
Year:
2005

Citing 4
Cited 0

Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

ICS '01 Proceedings of the 15th international conference on Supercomputing
Dissecting Cyclops: a detailed analysis of a multithreaded architecture

ACM SIGARCH Computer Architecture News
Evaluation of a Multithreaded Architecture for Cellular Computing

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
TiNy Threads: A Thread Virtual Machine for the Cyclops64 Cellular Architecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focuses on the Cyclops64 computer architecture and presents an analytical model and performance simulation results for the preloading and loop unrolling approaches to optimize the performance of SVD (Singular Value Decomposition) benchmark. A performance model for dissecting the total execution cycles is presented. The data preloading using “memcpy” or hand optimized “inline” assembly code, and the loop unrolling approach are implemented and compared with each other in terms of the total number of memory access cycles. The key idea is to preload data from offchip to onchip memory and store the data back after the computation. These approaches can reduce the total memory access cycles and can thus improve the benchmark performance significantly.