A protocol for wait-free, atomic, multi-reader shared variables
PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
A performance evaluation of lock-free synchronization protocols
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Introduction to the wire-speed processor and architecture
IBM Journal of Research and Development
Low latency energy efficient communications in global-scale cloud computing systems
Proceedings of the 2013 workshop on Energy efficient high performance parallel and distributed computing
Hi-index | 0.00 |
The three major solutions for increasing the nominal performance of a CPU are: multiplying the number of cores per socket, expanding the embedded cache memories and use multi-threading to reduce the impact of the deep memory hierarchy. Systems with tens or hundreds of hardware threads, all sharing a cache coherent UMA or NUMA memory space, are today the de-facto standard. While these solutions can easily provide benefits in a multi-program environment, they require recoding of applications to leverage the available parallelism. Threads must synchronize and exchange data, and the overall performance is heavily in influenced by the overhead added by these mechanisms, especially as developers try to exploit finer grain parallelism to be able to use all available resources.