Shared Register File Based ILP for Multicore

Authors:
Lihan Ju;Wei Hu;Lingxiang Xiang;Tianzhou Chen
Affiliations:
-;-;-;-
Venue:
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Year:
2010

Citing 18
Cited 0

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
An integer linear programming based approach for parallelizing applications in On-chip multiprocessors

Proceedings of the 39th annual Design Automation Conference
Runtime Code Parallelization for On-Chip Multiprocessors

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings

Proceedings of the 20th annual international conference on Supercomputing
Hybrid multi-core architecture for boosting single-threaded performance

ACM SIGARCH Computer Architecture News
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Global Multi-Threaded Instruction Scheduling

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimal speedup on a low-degree multi-core parallel architecture (LoPRAM)

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Data partitioning on chip multiprocessors

Proceedings of the 4th international workshop on Data management on new hardware
How much parallelism is there in irregular applications?

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Load balancing using work-stealing for pipeline parallelism in emerging applications

Proceedings of the 23rd international conference on Supercomputing
Core-Selectability in Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
A case for dynamic frequency tuning in on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the development of semi-conductor industry, more transistors can be integrated onto a single chip. But the software programming model cannot fit the parallelism requirement of CMP (Chip Multi Processor) based architecture. The communication between different cores becomes a very serious problem, and it made bad effectiveness on performance. This paper proposes an approach called API (Architecture of Parallelism on Instructions) which can scan the source code of the programs, analyze the data dependency, and cluster retentive instructions together. The instructions without dependency can be issued directly in parallel by different cores. API provides a global register file for the effective execution of the programs on CMP chips. We have also evaluated the time consuming comparison between API and the traditional architecture in our experiments by using SPEC benchmark CPU2000. The experimental results show that the instruction clock in API is only 49 percent of original instruction clocks. Moreover, there only need 4 cores to approach the best performance.