The impact of synchronization and granularity on parallel systems

Authors:
Ding-Kai Chen;Hong-Men Su;Pen-Chung Yew
Affiliations:
Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, Illinois;Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, Illinois;Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Urbana, Illinois
Venue:
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Year:
1990

Citing 14
Cited 21

A Scheme to Enforce Data Dependence on Large Multiprocessor Systems

IEEE Transactions on Software Engineering
And Now a Case for More Complex Instruction Sets

Computer
Effect of storage allocation/reclamation methods on parallelism and storage requirements

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
The horizon supercomputing system: architecture and software

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Assessing the benefits of fine-grain parallelism in dataflow programs

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Available instruction-level parallelism for superscalar and superpipelined machines

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Limits on multiple instruction issue

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
On data synchronization for multiprocessors

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Static synchronization beyond VLIW

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Using an oracle to measure potential parallelism in single instruction stream programs

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Compiler optimizations and architecture design issues for multiprocessors (parallel)

Compiler optimizations and architecture design issues for multiprocessors (parallel)

Compiler algorithms for event variable synchronization

ICS '91 Proceedings of the 5th international conference on Supercomputing
Analysis and transformation in the ParaScope editor

ICS '91 Proceedings of the 5th international conference on Supercomputing
Execution-driven tools for parallel simulation of parallel architectures and applications

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A distributed memory LAPSE: parallel simulation of message-passing programs

PADS '94 Proceedings of the eighth workshop on Parallel and distributed simulation
An approach to scalability study of shared memory parallel systems

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Timing simulation of paragon codes using workstation clusters

WSC '94 Proceedings of the 26th conference on Winter simulation
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Parallelized Direct Execution Simulation of Message-Passing Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Proceedings of the 25th annual international symposium on Computer architecture
Multiscalar processors

25 years of the international symposia on Computer architecture (selected papers)
Distributed data flow computing system

ACM-SE 30 Proceedings of the 30th annual Southeast regional conference
Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors

IEEE Transactions on Computers
Compiler-directed run-time monitoring of program data access

Proceedings of the 2002 workshop on Memory system performance
Automatic run-time extraction of communication graphs from multithreaded applications

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Lightweight lock-free synchronization methods for multithreading

Proceedings of the 20th annual international conference on Supercomputing
HPP controller: a system controller for high performance computing

Frontiers of Computer Science in China
Adaptive parallel approximate similarity search for responsive multimedia retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
Runtime adjustment of parallel nested loops

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
DVM: towards a datacenter-scale virtual machine

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Support for fine-grained synchronization in shared-memory multiprocessors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration level parallelism seems to be a more appropriate level when those factors are considered. We also study barrier synchronization and data synchronization at the loop iteration level and found both schemes are needed for a better performance.