Performance counters and state sharing annotations: a unified approach to thread locality

Authors:
Boris Weissman
Affiliations:
University of California at Berkeley and International Computer Science Institute, 1947 Center St, Suite 600, Berkeley, CA
Venue:
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Year:
1998

Citing 16
Cited 10

Footprints in the cache

ACM Transactions on Computer Systems (TOCS)
Portable programs for parallel processors

Portable programs for parallel processors
An analytical cache model

ACM Transactions on Computer Systems (TOCS)
The performance implications of thread management alternatives for shared-memory multiprocessors

SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Avoiding conflict misses dynamically in large direct-mapped caches

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The performance implications of locality information usage in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Modeling cost/performance of a parallel computer simulator

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
Efficient Extensible Synchronization in Sather

ISCOPE '97 Proceedings of the Scientific Computing in Object-Oriented Parallel Environments
Portable, modular expression of locality

Portable, modular expression of locality
Shade: A Fast Instruction Set Simulator for Execution Profiling

Shade: A Fast Instruction Set Simulator for Execution Profiling
Combinatorial Algorithms: Theory and Practice

Combinatorial Algorithms: Theory and Practice
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

Scheduling threads for low space requirement and good locality

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
The benefits of event: driven energy accounting in power-sensitive systems

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
High-performance thread migration on clusters of SMPs

Cluster computing
Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Performance-driven processor allocation

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Enhancements for hyper-threading technology in the operating system: seeking the optimal scheduling

WIESS'02 Proceedings of the 2nd conference on Industrial Experiences with Systems Software - Volume 2
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
How many threads to spawn during program multithreading?

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Work-stealing with configurable scheduling strategies

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a combined approach for improving thread locality that uses the bardware performance monitors of modem processors and program-centric code annotations to guide thread scheduling on SMPs. The approach relies on a shared state cache model to compute expected thread footprints in the cache on-line. The accuracy of the model has been analyzed by simmations involving a set of parallel applications. We demonstrate how the cache model can be used to implement several practical locality-based thread scheduling policies with little overhead. Active Threads, a portable, high-performance thread system, has been built and used to investigate the performance impact of locality scheduling for several applications.