Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

Authors:
Richard M. Yoo;Christopher J. Hughes;Konrad Lai;Ravi Rajwar
Affiliations:
Parallel Computing Laboratory, Intel Labs, Santa Clara, CA;Parallel Computing Laboratory, Intel Labs, Santa Clara, CA;Intel Architecture Development Group, Intel Architecture Group, Hillsboro, OR;Intel Architecture Development Group, Intel Architecture Group, Hillsboro, OR
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 19
Cited 3

Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Experience with processes and monitors in Mesa

Communications of the ACM
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Multiple Reservations and the Oklahoma Update

IEEE Parallel & Distributed Technology: Systems & Technology
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Kicking the tires of software transactional memory: why the going gets tough

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Early experience with a commercial hardware transactional memory implementation

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
QuakeTM: parallelizing a complex sequential application using transactional memory

Proceedings of the 23rd international conference on Supercomputing
Simplifying concurrent algorithms by exploiting hardware transactional memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
RMS-TM: a comprehensive benchmark suite for transactional memory systems

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Transactional locking II

DISC'06 Proceedings of the 20th international conference on Distributed Computing
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
High Performance Non-uniform FFT on Modern X86-based Multi-core Systems

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Evaluation of Blue Gene/Q hardware support for transactional memories

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
What scientific applications can benefit from hardware transactional memory?

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Transactional Memory Architecture and Implementation for IBM System Z

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

SI-TM: reducing transactional memory abort rates through snapshot isolation

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Scaling existing lock-based applications with lock elision

Communications of the ACM
Scaling Existing Lock-based Applications with Lock Elision

Queue - Performance

Quantified Score

Hi-index	0.02

Visualization

Abstract

Intel has recently introduced Intel® Transactional Synchronization Extensions (Intel® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, we evaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads, and demonstrate that applying Intel TSX to these workloads can provide significant performance improvements. On a set of real-world HPC workloads, applying Intel TSX provides an average speedup of 1.41x. When applied to a parallel user-level TCP/IP stack, Intel TSX provides 1.31x average bandwidth improvement on network intensive applications. We also demonstrate the ease with which we were able to apply Intel TSX to the various workloads.