Adaptive cache coherency for detecting migratory shared data
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Optimizing dynamically-dispatched calls with run-time type feedback
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Dynamic feedback: an effective technique for adaptive computing
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Eliminating synchronization bottlenecks in object-based programs using adaptive replication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Parallel programming in OpenMP
Parallel programming in OpenMP
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Parallel Programming with Polaris
Computer
Reducing Parallel Overheads Through Dynamic Serialization
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Adaptive loop transformations for scientific programs
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
ADAPT: Automated De-Coupled Adaptive Program Transformation
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Automatically Mapping Code on an Intelligent Memory Architecture
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Improving the Effectiveness of Software Prefetching with Adaptive Execution
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Compilation techniques for explicitly parallel programs
Compilation techniques for explicitly parallel programs
IBM Systems Journal
Adaptively increasing performance and scalability of automatically parallelized programs
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Time vs. space adaptation with ATOP-grid
Proceedings of the 5th workshop on Adaptive and reflective middleware (ARM '06)
Online power-performance adaptation of multithreaded programs using hardware event-based prediction
Proceedings of the 20th annual international conference on Supercomputing
Time and space adaptation for computational grids with the ATOP-Grid middleware
Future Generation Computer Systems
Adapting application execution in CMPs using helper threads
Journal of Parallel and Distributed Computing
Adaptive execution techniques of parallel programs for multiprocessors
Journal of Parallel and Distributed Computing
Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
Proceedings of the 37th annual international symposium on Computer architecture
A workload-aware mapping approach for data-parallel programs
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
When less is more (LIMO):controlled parallelism forimproved efficiency
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Adaptive parallelism for web search
Proceedings of the 8th ACM European Conference on Computer Systems
Multiverse: efficiently supporting distributed high-level speculation
Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Hi-index | 0.00 |
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize performance in an SMT multiprocessor, finding the optimal number of threads is important. This paper presents adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at run time the optimal number of threads for each parallel loop in the application. Using 10 standard numerical applications and running them with our techniques on an Intel 4-processor Hyper-Threading Xeon SMP with 8 logical processors, our code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.