The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data cache performance of supercomputer applications
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
CP-PACS: a massively parallel processor for large scale scientific calculations
ICS '97 Proceedings of the 11th international conference on Supercomputing
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Augmenting Loop Tiling with Data Alignment for Improved Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Performance of lattice QCD programs on CP-PACS
Parallel Computing - Special issue on high performance computing in lattice QCD
IEEE Micro
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
Hi-index | 0.00 |
Technological trends have brought the growing disparity between processor and memory speeds. This memory wall problem is becoming very serious especially in high performance computing. In this paper, we propose a new architecture SCIMA for solving this problem. In SCIMA, addressable memory is integrated into the processor chip besides ordinary cache. Since the on-chip memory is software controllable, it has more ability to make good use of data locality than data cache, which is controlled by hardware. The purpose of on-chip memory is to reduce the off-chip memory traffic by exploiting data reusability as much as possible within a chip. We have evaluated SCIMA by using QCD simulation, a practical application in quantum field theory. The performance evaluation reveals that SCIMA successfully reduces off-chip memory traffic and achieves higher performance than cache-only processor.