HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor

Authors:
Sanghoon Lee;Devesh Tiwari;Yan Solihin;James Tuck
Affiliations:
Department of Electrical & Computer Engineering, North Carolina State University;Department of Electrical & Computer Engineering, North Carolina State University;Department of Electrical & Computer Engineering, North Carolina State University;Department of Electrical & Computer Engineering, North Carolina State University
Venue:
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Year:
2011

Citing 0
Cited 2

Load-balanced pipeline parallelism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic parallelization of fine-grained metafunctions on a chip multiprocessor

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application's address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol; and (3) hardware ensures that the full state of the queue is stored in the application's address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic, fine-grained parallelization using decoupled software pipelining.