The Stanford Dash Multiprocessor
Computer
Performance modelling of a multiprocessor bus architecture
ANSS '91 Proceedings of the 24th annual symposium on Simulation
Proceedings of the 38th annual Design Automation Conference
Data cache locking for higher program predictability
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cache memories: A tutorial and survey of current research directions
ACM '82 Proceedings of the ACM '82 conference
Power/Performance Advantages of Victim Buffer in High-Performance Processors
VOLTA '99 Proceedings of the IEEE Alessandro Volta Memorial Workshop on Low-Power Design
Using a Victim Buffer in an Application-Specific Memory Hierarchy
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Design for Timing Predictability
Real-Time Systems
Cache coherence support for non-shared bus architecture on heterogeneous MPSoCs
Proceedings of the 42nd annual Design Automation Conference
A survey of research and practices of Network-on-chip
ACM Computing Surveys (CSUR)
Cache coherence tradeoffs in shared-memory MPSoCs
ACM Transactions on Embedded Computing Systems (TECS)
Performance evaluation of exclusive cache hierarchies
ISPASS '04 Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software
Detecting Conflicts of Interest
RE '06 Proceedings of the 14th IEEE International Requirements Engineering Conference
Proximity-aware directory-based coherence for multi-core processor architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
The worst-case execution-time problem—overview of methods and survey of tools
ACM Transactions on Embedded Computing Systems (TECS)
Performance and energy trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs
Proceedings of the 20th symposium on Great lakes symposium on VLSI
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics.