Boosting single-thread performance in multi-core systems through fine-grain multi-threading

Authors:
Carlos Madriles;Pedro López;Josep M. Codina;Enric Gibert;Fernando Latorre;Alejandro Martinez;Raúl Martinez;Antonio Gonzalez
Affiliations:
Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain;Intel Barcelona Research Center, Intel Labs - Universitat Politecnica de Catalunya, Barcelona, Spain
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 23
Cited 3

Analysis of multilevel graph partitioning

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor

ICS '98 Proceedings of the 12th international conference on Supercomputing
A dynamic multithreading processor

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Data speculation support for a chip multiprocessor

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Architectural support for scalable speculative parallelization in shared-memory multiprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Removing architectural bottlenecks to the scalability of speculative parallelization

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Master/slave speculative parallelization

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
A Cost-Effective Clustered Architecture

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Thread-Spawning Schemes for Speculative Multithreading

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Min-cut program decomposition for thread-level speculation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Back-end assignment schemes for clustered multithreaded processors

Proceedings of the 18th annual international conference on Supercomputing
Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs

Proceedings of the 33rd annual international symposium on Computer Architecture
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Core fusion: accommodating software diversity in chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Speculative Decoupled Software Pipelining

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture

CoreSymphony: an efficient reconfigurable multi-core architecture

ACM SIGARCH Computer Architecture News
Complementing user-level coarse-grain parallelism with implicit speculative parallelism

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Global register alias table: Boosting sequential program on multi-core

Future Generation Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Industry has shifted towards multi-core designs as we have hit the memory and power walls. However, single thread performance remains of paramount importance since some applications have limited thread-level parallelism (TLP), and even a small part with limited TLP impose important constraints to the global performance, as explained by Amdahl's law. In this paper we propose a novel approach for leveraging multiple cores to improve single-thread performance in a multi-core design. The proposed technique features a set of novel hardware mechanisms that support the execution of threads generated at compile time. These threads result from a fine-grain speculative decomposition of the original application and they are executed under a modified multi-core system that includes: (1) mechanisms to support multiple versions; (2) mechanisms to detect violations among threads; (3) mechanisms to reconstruct the original sequential order; and (4) mechanisms to checkpoint the architectural state and recovery to handle misspeculations. The proposed scheme outperforms previous hardware-only schemes to implement the idea of combining cores for executing single-thread applications in a multi-core design by more than 10% on average on Spec2006 for all configurations. Moreover, single-thread performance is improved by 41% on average when the proposed scheme is used on a Tiny Core, and up to 2.6x for some selected applications.