Impact of sharing-based thread placement on multithreaded architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The performance implications of locality information usage in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Performance counters and state sharing annotations: a unified approach to thread locality
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
ACM Computing Surveys (CSUR)
Symbiotic jobscheduling for a simultaneous multithreaded processor
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Effects of Memory Performance on Parallel Job Scheduling
JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Using Cohort-Scheduling to Enhance Server Performance
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Architectural Support for Enhanced SMT Job Scheduling
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Online performance analysis by statistical sampling of microprocessor performance counters
Proceedings of the 19th annual international conference on Supercomputing
Chip multithreading systems need a new operating system scheduler
Proceedings of the 11th workshop on ACM SIGOPS European workshop
Performance of multithreaded chip multiprocessors and implications for operating system design
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Hyper-threading aware process scheduling heuristics
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Enhancements for hyper-threading technology in the operating system: seeking the optimal scheduling
WIESS'02 Proceedings of the 2nd conference on Industrial Experiences with Systems Software - Volume 2
Steps towards cache-resident transaction processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Compatible phase co-scheduling on a CMP of multi-threaded processors
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Provably good multicore cache performance for divide-and-conquer algorithms
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Contention-aware scheduler: unlocking execution parallelism in multithreaded java programs
Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Sharing-aware OS scheduling algorithms for multi-socket multi-core servers
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Enhancing operating system support for multicore processors by using hardware performance monitoring
ACM SIGOPS Operating Systems Review
IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
The multikernel: a new OS architecture for scalable multicore systems
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Managing contention for shared resources on multicore processors
Communications of the ACM
A case for integrated processor-cache partitioning in chip multiprocessors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Managing Contention for Shared Resources on Multicore Processors
Queue - Power Management
Probabilistic job symbiosis modeling for SMT processor scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Addressing shared resource contention in multicore processors via scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Bias scheduling in heterogeneous multi-core architectures
Proceedings of the 5th European conference on Computer systems
Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Reinventing scheduling for multicore systems
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Corey: an operating system for many cores
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Contention-Aware Scheduling on Multicore Systems
ACM Transactions on Computer Systems (TOCS)
Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Online cache modeling for commodity multicore processors
ACM SIGOPS Operating Systems Review
ULCC: a user-level facility for optimizing shared cache performance on multicores
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Dynamic cache contention detection in multi-threaded applications
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Exploring implicit parallelism in class diagrams
Journal of Systems and Software
Array regrouping on CMP with non-uniform cache sharing
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A QHD-capable parallel H.264 decoder
Proceedings of the international conference on Supercomputing
The impact of memory subsystem resource sharing on datacenter applications
Proceedings of the 38th annual international symposium on Computer architecture
A case for NUMA-aware contention management on multicore systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
FACT: a framework for adaptive contention-aware thread migrations
Proceedings of the 8th ACM International Conference on Computing Frontiers
Overseer: low-level hardware monitoring and management for Java
Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Optimal task assignment in multithreaded processors: a statistical approach
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Region scheduling: efficiently using the cache architectures via page-level affinity
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
ACM Transactions on Computer Systems (TOCS)
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Is reuse distance applicable to data locality analysis on chip multiprocessors?
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Phase-based tuning for better utilization of performance-asymmetric multicore processors
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Share memory aware scheduler: balancing performance and fairness
Proceedings of the great lakes symposium on VLSI
Probabilistic modeling for job symbiosis scheduling on SMT processors
ACM Transactions on Architecture and Code Optimization (TACO)
Reuse distance based performance modeling and workload mapping
Proceedings of the 9th conference on Computing Frontiers
Matching memory access patterns and data placement for NUMA systems
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A template library to integrate thread scheduling and locality management for NUMA multiprocessors
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
MemProf: a memory profiler for NUMA multicore systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
ADAPT: A framework for coscheduling multithreaded programs
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling
ACM Transactions on Architecture and Code Optimization (TACO)
Automatic generation of program affinity policies using machine learning
CC'13 Proceedings of the 22nd international conference on Compiler Construction
Traffic management: a holistic approach to memory placement on NUMA systems
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Dynamic threshold for imbalance assessment on load balancing for multicore systems
Computers and Electrical Engineering
SMT-centric power-aware thread placement in chip multiprocessors
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Model-based cache-aware dispatching of object-oriented software for multicore systems
Journal of Systems and Software
Dynamic thread pinning for phase-based OpenMP programs
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Crank it up or dial it down: coordinated multiprocessor frequency and folding control
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
On modeling contention for shared caches in multi-core processors with techniques from ecology
Natural Computing: an international journal
Virtual Machine Coscheduling: A Game Theoretic Approach
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.02 |
The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses. In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.