URPR—An extension of URCR for software pipelining
MICRO 19 Proceedings of the 19th annual workshop on Microprogramming
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The performance implications of thread management alternatives for shared-memory multiprocessors
SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Introduction to algorithms
An efficient algorithm for a task allocation problem
Journal of the ACM (JACM)
Instruction-level parallel processing: history, overview, and perspective
The Journal of Supercomputing - Special issue on instruction-level parallelism
VLIW compilation techniques in a superscalar environment
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
The definition of dependence distance
ACM Transactions on Programming Languages and Systems (TOPLAS)
A compilation technique for software pipelining of loops with conditional jumps
MICRO 20 Proceedings of the 20th annual workshop on Microprogramming
Performance counters and state sharing annotations: a unified approach to thread locality
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Scheduling threads for low space requirement and good locality
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Dependence Analysis
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Conversion of control dependence to data dependence
POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Structure of Computers and Computations
Structure of Computers and Computations
Making Compaction-Based Parallelization Affordable
IEEE Transactions on Parallel and Distributed Systems
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Task allocation in distributed systems: A survey of practical strategies
ACM '82 Proceedings of the ACM '82 conference
Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing)
Parallelism, memory anti-aliasing and correctness for trace scheduling compilers (disambiguation, flow-analysis, compaction)
Compaction-based parallelization
Compaction-based parallelization
IEEE Transactions on Computers
Dual Processor Scheduling with Dynamic Reassignment
IEEE Transactions on Software Engineering
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
On the evaluation and extraction of thread-level parallelism in ordinary programs
On the evaluation and extraction of thread-level parallelism in ordinary programs
Intel threading building blocks
Intel threading building blocks
Compiler-Driven Dependence Profiling to Guide Program Parallelization
Languages and Compilers for Parallel Computing
Techniques for efficient placement of synchronization primitives
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization optimizations for efficient execution on multi-cores
Proceedings of the 23rd international conference on Supercomputing
Hi-index | 0.00 |
Thread-level program parallelization is key for exploiting the hardware parallelism of the emerging multi-core systems. Several techniques have been proposed for program multithreading. However, the existing techniques do not address the following key issues associated with multithread execution of a given program: (a) Whether multithreaded execution is faster than sequential execution; (b) How many threads to spawn during program multithreading. In this paper, we address the above limitations. Specifically, we propose a novel approach - T-OPT- to determine how many threads to spawn during multithreaded execution of a given program region. The latter helps to check under-subscribing and oversubscribing of the hardware resources. This in turn facilitates exploitation on higher level of thread-level parallelism (TLP) than what can be achieved using the state-of-the-art. We show that, from program dependence standpoint, use of larger number of threads than advocated by the proposed approach does not yield higher degree of TLP. We present a couple of case studies and results using kernels, extracted from open source codes, to demonstrate the efficacy of our techniques on a real machine.