Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Authors:
Abu Asaduzzaman;Fadi N. Sibai;Manira Rani
Affiliations:
Computer Sci. and Eng. Dept, Florida Atlantic University, Boca Raton, FL, USA;College of Information Tech., UAE University, P.O. Box 17555, Al Ain, United Arab Emirates;Computer Sci. and Eng. Dept, Florida Atlantic University, Boca Raton, FL, USA
Venue:
Microprocessors & Microsystems
Year:
2009

Citing 18
Cited 2

The Stanford Dash Multiprocessor

Computer
Performance modelling of a multiprocessor bus architecture

ANSS '91 Proceedings of the 24th annual symposium on Simulation
Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip

Proceedings of the 38th annual Design Automation Conference
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cache memories: A tutorial and survey of current research directions

ACM '82 Proceedings of the ACM '82 conference
Power/Performance Advantages of Victim Buffer in High-Performance Processors

VOLTA '99 Proceedings of the IEEE Alessandro Volta Memorial Workshop on Low-Power Design
System level performance estimation of multi-processing, multi-threading SoC architectures for networking applications

SystemC
Using a Victim Buffer in an Application-Specific Memory Hierarchy

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Design for Timing Predictability

Real-Time Systems
Cache coherence support for non-shared bus architecture on heterogeneous MPSoCs

Proceedings of the 42nd annual Design Automation Conference
Guest Editors' Introduction: Multiprocessor Systems-on-Chips

Computer
A survey of research and practices of Network-on-chip

ACM Computing Surveys (CSUR)
Cache coherence tradeoffs in shared-memory MPSoCs

ACM Transactions on Embedded Computing Systems (TECS)
Performance evaluation of exclusive cache hierarchies

ISPASS '04 Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software
Detecting Conflicts of Interest

RE '06 Proceedings of the 14th IEEE International Requirements Engineering Conference
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
The worst-case execution-time problem—overview of methods and survey of tools

ACM Transactions on Embedded Computing Systems (TECS)
On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures

Microprocessors & Microsystems

Performance and energy trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Performance metrics in a hybrid MPI-OpenMP based molecular dynamics simulation with short-range interactions

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics.