ACM Computing Surveys (CSUR)
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Eraser: a dynamic data race detector for multi-threaded programs
Proceedings of the sixteenth ACM symposium on Operating systems principles
RecPlay: a fully integrated practical record/replay system
ACM Transactions on Computer Systems (TOCS)
The Augmint multiprocessor simulation toolkit for Intel x86 architectures
ICCD '96 Proceedings of the 1996 International Conference on Computer Design, VLSI in Computers and Processors
A "flight data recorder" for enabling full-system multiprocessor deterministic replay
Proceedings of the 30th annual international symposium on Computer architecture
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Valgrind: a framework for heavyweight dynamic binary instrumentation
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Thousand core chips: a technology perspective
Proceedings of the 44th annual Design Automation Conference
HARD: Hardware-Assisted Lockset-based Race Detection
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Rerun: Exploiting Episodes for Lightweight Memory Race Recording
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
In-band cross-trigger event transmission for transaction-based debug
Proceedings of the conference on Design, automation and test in Europe
RunAssert: a non-intrusive run-time assertion for parallel programs debugging
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.00 |
Traditional debug methodologies are limited in their ability to provide debugging support for many-core parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with software debugging tools. Most traditional debugging approaches rely on globally synchronized signals, but these pose problems in terms of scalability. The first contribution of this paper is to propose a novel nonuniform debug architecture (NUDA) based on a ring interconnection schema. Our approach makes debugging both feasible and scalable for many-core processing scenarios. The key idea is to distribute the debugging support structures across a set of hierarchical clusters while avoiding address overlap. This allows the address space to be monitored using non-uniform protocols. Our second contribution is a non-intrusive approach to race detection supported by the NUDA. A non-uniform page-based monitoring cache in each NUDA node is used to watch the access footprints. The union of all the caches can serve as a race detection probe. Using the proposed approach, we show that parallel race bugs can be precisely captured, and that most false-positive alerts can be efficiently eliminated at an average slow-down cost of only 1.4%~3.6%. The net hardware cost is relatively low, so that the NUDA can readily scale increasingly complex many-core systems.