Scaling the bandwidth wall: challenges in and avenues for CMP scaling
Proceedings of the 36th annual international symposium on Computer architecture
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Tale in the multi-core era: is java still competitive to host SIP applications?
ICC'09 Proceedings of the 2009 IEEE international conference on Communications
An analysis of Linux scalability to many cores
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Experiences in building and scaling an enterprise application on multicore systems
Concurrency and Computation: Practice & Experience
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Everything you always wanted to know about synchronization but were afraid to ask
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Hi-index | 0.00 |
The multi-threaded nature of many commercial applications makes them seemingly a good fit with the increasing number of available multi-core architectures. This paper presents our performance studies of a collection of commercial workloads on a multi-core system that is designed for total throughput. The selected workloads include full operational applications such as SAP-SD and IBM Trade, and popular synthetic benchmarks such as SPECjbb2005, SPEC SDET, Dbench, and Tbench. To evaluate the performance scalability and the thread-placement sensitivity, we monitor the application throughput, processor performance, and the memory subsystem of 8, 16, 24, and 32 hardware threads with (a) increasing number of cores and (b) increasing number of threads per core. We observe that these workloads scale close to linearly (with efficiencies ranging from 86% to 99%) with increasing number of cores. For scaling with hardware-threads per core, the efficiencies are between 50% and 70%. Furthermore, among other observations, our data show that the ability of hiding long latency memory operations (i.e. L2 misses) in a multi-core system enables the performance scaling.