Whole-program optimization for time and space efficient threads

Authors:
Dirk Grunwald;Richard Neves
Affiliations:
Dept. of Computer Science, University of Colorado, Campus Box 430, Boulder, CO;P.O. Box 218, IBM. T.J. Watson Research, Yorktown Heights, NY
Venue:
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Year:
1996

Citing 18
Cited 10

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Global register allocation at link time

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
The Mahler experience: using an intermediate language as the machine description

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A portable global optimizer and linker

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
A simple interprocedural register allocation algorithm and its effectiveness for LISP

ACM Transactions on Programming Languages and Systems (TOPLAS)
Register allocation across procedure and module boundaries

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Representing control in the presence of first-class continuations

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The interaction of architecture and operating system design

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Page placement algorithms for large real-indexed caches

ACM Transactions on Computer Systems (TOCS)
Space-efficient closure representations

LFP '94 Proceedings of the 1994 ACM conference on LISP and functional programming
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Performance of a hardware-assisted real-time garbage collector

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A hybrid execution model for fine-grained languages on distributed memory multicomputers

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
An overview of the mesa processor architecture

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
Fast procedure calls

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

Exploiting dead value information

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Pthreads for dynamic and irregular parallelism

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Mondrian memory protection

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Balancing register allocation across threads for a multithreaded network processor

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
MTSS: multi task stack sharing for embedded systems

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Offline compression for on-chip ram

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
MTSS: Multitask stack sharing for embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Eliminating the call stack to save RAM

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Compiler support for lightweight context switching

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication latency, either with hardwares devices or users. Client-server applications typically use threads to simplify the complex control-flow that arises when multiple clients are used. Recently, the scientific computing community has started using threads to mask network communication latency in massively parallel architectures, allowing computation and communication to be overlapped. Lastly, some architectures implement threads in hardware, using those threads to tolerate memory latency.In general, it would be desirable if threaded programs could be written to expose the largest degree of parallelism possible, or to simplify the program design. However, threads incur time and space overheads, and programmers often compromise simple designs for performance. In this paper, we show how to reduce time and space thread overhead using control flow and register liveness information inferred after compilation. Our techniques work on binaries, are not specific to a particular compiler or thread library and reduce the the overall execution time of fine-grain threaded programs by ≈ 15-30%. We use execution-driven analysis and an instrumented operating system to show why the execution time is reduced and to indicate areas for future work.