Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
The Stanford Dash Multiprocessor
Computer
Reactive synchronization algorithms for multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance experiences on Sun's Wildfire prototype
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Scalable queue-based spin locks with timeout
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Non-blocking timeout in scalable queue-based spin locks
Proceedings of the twenty-first annual symposium on Principles of distributed computing
The sun fireplane system interconnect
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Queue Locks on Cache Coherent Multiprocessors
Proceedings of the 8th International Symposium on Parallel Processing
Dynamic decentralized cache schemes for mimd parallel processors
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
WildFire: A Scalable Path for SMPs
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Hierarchical Backoff Locks for Nonuniform Communication Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Efficient self-tuning spin-locks using competitive analysis
Journal of Systems and Software
The Power of Priority: NoC Based Distributed Cache Coherency
NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Smartlocks: lock acquisition scheduling for self-aware synchronization
Proceedings of the 7th international conference on Autonomic computing
Factory: an object-oriented parallel programming substrate for deep multiprocessors
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Hi-index | 0.00 |
Scalable parallel computers are often nonuniform communication architectures (NUCAs), where the access time to other processor's caches vary with their physical location. Still, few attempts of exploring cache-to-cache communication locality have been made. This paper introduces a new kind of synchronization primitives (lock-unlock) that favor neighboring processors when a lock is released. This improves the lock handover time as well as access time to the shared data of the critical region.A critical section guarded by our new RH lock takes less than half the time to execute compared with the same critical section guarded by any other lock on our NUCA hardware. The execution time for Raytrace with 28 processors was improved 2.23--4.68 times, while global traffic was dramatically decreased compared with all the other locks. The average execution time was improved 7--24% while the global traffic was decreased 8-28% for an average over the seven applications studied.