The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
A VLIW architecture for a trace scheduling compiler
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Data speculation support for a chip multiprocessor
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
A general compiler framework for speculative multithreading
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Min-cut program decomposition for thread-level speculation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Decoupled Software Pipelining with the Synchronization Array
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Exposing speculative thread parallelism in SPEC2000
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Chip multi-processor scalability for single-threaded applications
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Hardware-modulated parallelism in chip multiprocessors
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Automatic mapping of nested loops to FPGAS
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Global Multi-Threaded Instruction Scheduling
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
CMP brings more benefits comparing with uni-core processor, but CMP is not fit for legacy code well because legacy code bases on uni-core processor. This paper presents a novel Thread Level Parallelism technology called Dataflow Abstracting Thread (DFAT). DFAT builds a United Dependence Graph (UDG) for the program and decouples single thread into many threads which can run on CMP parallelly. DFAT analyzes the program's data-, control- and anti-dependence and gets a dependence graph, then dependences are combined and be added some attributes to get a UDG. The UDG decides instructions execution order, and according to this, instructions can be assigned to different thread one by one. An algorithm decides how to assign those instructions. DFAT considers both communication overhead and thread balance after the original thread division. Thread communication in DFAT is implemented by producer-consumer model. DFAT can automatically abstract multi-thread from single thread and be implemented in compiler. In our evaluation, we decouple single thread into at most 8 threads with DFAT and the result shows that decoupling single thread into 4-6 threads can get best benefits.