Landing stencil code on Godson-T

Authors:
Hui-Min Cui;Lei Wang;Dong-Rui Fan;Xiao-Bing Feng
Affiliations:
Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, ...;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Beijing, ...;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Venue:
Journal of Computer Science and Technology
Year:
2010

Citing 24
Cited 1

The architecture of HEP

on Parallel MIMD computation: HEP supercomputer and its applications
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Proceedings of the 25th annual international symposium on Computer architecture
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms

IEEE Micro
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS

LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The GeForce 6800

IEEE Micro
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
The memory behavior of cache oblivious stencil computations

The Journal of Supercomputing
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Experience on optimizing irregular computation for memory hierarchy in manycore architecture

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
A Performance Model of Dense Matrix Operations on Many-Core Architectures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture

PDCAT '08 Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies
Mapping the LU decomposition on a many-core architecture: challenges and solutions

Proceedings of the 6th ACM conference on Computing frontiers
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.