MULTILISP: a language for concurrent symbolic computation
ACM Transactions on Programming Languages and Systems (TOPLAS)
Resource requirements of dataflow programs
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A randomized parallel branch-and-bound procedure
STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Mul-T: a high-performance parallel Lisp
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Computers
A future-based parallel language for a general-purpose highly-parallel computer
Selected papers of the second workshop on Languages and compilers for parallel computing
COOL: a language for parallel programming
Selected papers of the second workshop on Languages and compilers for parallel computing
Data locality and load balancing in COOL
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Randomized parallel algorithms for backtrack search and branch-and-bound computation
Journal of the ACM (JACM)
Provably efficient scheduling for languages with fine-grained parallelism
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Programming parallel algorithms
Communications of the ACM
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An analysis of dag-consistent distributed shared-memory algorithms
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
The performance implications of locality information usage in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Efficient detection of determinacy races in Cilk programs
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Thread scheduling for multiprogrammed multiprocessors
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Performance counters and state sharing annotations: a unified approach to thread locality
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Scheduling threads for low space requirement and good locality
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Implementation of multilisp: Lisp on a multiprocessor
LFP '84 Proceedings of the 1984 ACM Symposium on LISP and functional programming
Executing functional programs on a virtual tree of processors
FPCA '81 Proceedings of the 1981 conference on Functional programming languages and computer architecture
Parallel Algorithms for Combinatorial Search Problems
Parallel Algorithms for Combinatorial Search Problems
The Performance of Work Stealing in Multiprogrammed Environments
The Performance of Work Stealing in Multiprogrammed Environments
Parsing flowcharts and series-parallel graphs.
Parsing flowcharts and series-parallel graphs.
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Non-blocking steal-half work queues
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Using Cohort-Scheduling to Enhance Server Performance
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
A dynamic-sized nonblocking work stealing deque
Distributed Computing - Special issue: DISC 04
Adaptive work stealing with parallelism feedback
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
A portable runtime interface for multi-level memory hierarchies
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Adaptive work-stealing with parallelism feedback
ACM Transactions on Computer Systems (TOCS)
An adaptive cut-off for task parallelism
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 36th annual international symposium on Computer architecture
The design of a task parallel library
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Lazy binary-splitting: a run-time adaptive work-stealing scheduler
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Performance Evaluation of Work Stealing for Streaming Applications
OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
Distributed Scheduling of Parallel Hybrid Computations
ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
A dynamic-sized nonblocking work stealing deque
A dynamic-sized nonblocking work stealing deque
Evaluation of OpenMP task scheduling strategies
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Multi-GPU and multi-CPU parallelization for interactive physics simulations
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Affinity driven distributed scheduling algorithm for parallel computations
ICDCN'11 Proceedings of the 12th international conference on Distributed computing and networking
Parallelization libraries: Characterizing and reducing overheads
ACM Transactions on Architecture and Code Optimization (TACO)
Implicitly threaded parallelism in manticore
Journal of Functional Programming
Performance driven distributed scheduling of parallel hybrid computations
Theoretical Computer Science
Balance principles for algorithm-architecture co-design
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Performance driven multi-objective distributed scheduling for parallel computations
ACM SIGOPS Operating Systems Review
Fast updates on read-optimized databases using multi-core CPUs
Proceedings of the VLDB Endowment
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
BWS: balanced work stealing for time-sharing multicores
Proceedings of the 7th ACM european conference on Computer Systems
How to achieve scalable fork/join on many-core architectures?
Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity
Characterizing and mitigating work time inflation in task parallel programs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
NUMA-aware graph mining techniques for performance and energy efficiency
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Compiler support for lightweight context switching
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Dynamic distributed scheduling algorithm for state space search
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hybrid parallel task placement in X10
Proceedings of the third ACM SIGPLAN X10 Workshop
Design and implementation of a customizable work stealing scheduler
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Locality-aware task management for unstructured parallelism: a quantitative limit study
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Well-structured futures and cache locality
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Boosting CUDA Applications with CPU---GPU Hybrid Computing
International Journal of Parallel Programming
Characterizing and mitigating work time inflation in task parallel programs
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multi-threaded computations Gn each member of which requires &THgr;(n) total instructions (work) for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is &OHgr;(n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(C⌈m/s PT∞), where m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∞ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.