Blue Gene/L compute chip: memory and Ethernet subsystem

Authors:
M. Ohmacht;R. A. Bergamaschi;S. Bhattacharya;A. Gara;M. E. Giampapa;B. Gopalsamy;R. A. Haring;D. Hoenicke;D. J. Krolak;J. A. Marcella;B. J. Nathanson;V. Salapura;M. E. Wazlowski
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Engineering and Technology Services, Bangalore;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Engineering and Technology Services, Rochester, Minnesota;IBM Engineering and Technology Services, Rochester, Minne;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York
Venue:
IBM Journal of Research and Development
Year:
2005

Citing 5
Cited 11

Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Early analysis tools for system-on-a-chip design

IBM Journal of Research and Development
Embedded DRAM design and architecture for the IBM 0.11-µm ASIC offering

IBM Journal of Research and Development
The eDRAM based L3-Cache of the BlueGene/L Supercomputer Processor Node

SBAC-PAD '04 Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing

Integrated network interfaces for high-bandwidth TCP/IP

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Software routing and aggregation of messages to optimize the performance of HPCC randomaccess benchmark

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
System RAS implications of DRAM soft errors

IBM Journal of Research and Development
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Blue Gene/L compute chip: synthesis, timing, and physical design

IBM Journal of Research and Development
Leveraging high performance data cache techniques to save power in embedded systems

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Global management of cache hierarchies

Proceedings of the 7th ACM international conference on Computing frontiers
Hierarchical model validation of symbolic performance models of scientific kernels

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
High performance 3D convolution for protein docking on IBM blue gene

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
IBM Blue Gene/Q memory subsystem with speculative execution and transactional memory

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Blue Gene®/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 (L2 and L3) to reduce memory access time. A Gigabit Ethernet interface driven by direct memory access (DMA) is integrated in the cache hierarchy, requiring only an external physical link layer chip to connect to the media. The integrated L3 cache stores a total of 4 MB of data, using multibank embedded dynamic random access memory (DRAM). The 1,024-bit-wide data port of the embedded DRAM provides 22.4 GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine. To reduce hardware overhead due to cache coherence intervention requests, memory coherence is maintained by software. This is particularly efficient for regular highly parallel applications with partitionable working sets. The system further integrates an on-chip double-data-rate (DDR) DRAM controller for direct attachment of main memory modules to optimize overall memory performance and cost. For booting the system and low-latency interprocessor communication and synchronization, a 16-KB static random access memory (SRAM) and hardware locks have been added to the design.