Merging, sorting and matrix operations on the SOME-bus multiprocessor architecture

Authors:
Constantine Katsinis
Affiliations:
Electrical and Computer Engineering, Drexel University, Philadelphia, PA
Venue:
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Year:
2004

Citing 28
Cited 4

Performance analysis of MR-1, a clustered shared-memory multiprocessor

Journal of Parallel and Distributed Computing
“Hypermeshes”: optical interconnection networks for parallel computing

Journal of Parallel and Distributed Computing
Experimenting with a shared virtual memory environment for hypercubes

Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
S-connect: from networks of workstations to supercomputer performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Sorting, Selection, and Routing on the Array with Reconfigurable Optical Buses

IEEE Transactions on Parallel and Distributed Systems
Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

IEEE Transactions on Computers
Linear array with a reconfigurable pipelined bus system—concepts and applications

Information Sciences: an International Journal - special issue on parallel and distributed processing
Realizing Common Communication Patterns in Partitioned Optical Passive Stars (POPS) Networks

IEEE Transactions on Computers
Parallel Matrix Multiplication on a Linear Array with a Reconfigurable Pipelined Bus System

IEEE Transactions on Computers
On Time Bounds, the Work-Time Scheduling Principle, and Optimality for BSR

IEEE Transactions on Parallel and Distributed Systems
On the Performance of Parallel Matrix Factorisation on the Hypermesh

The Journal of Supercomputing
A Simulation Study of Hardware-Oriented DSM Approaches

IEEE Parallel & Distributed Technology: Systems & Technology
When Caches Aren't Enough: Data Prefetching Techniques

Computer
Balanced Parallel Sort on Hypercube Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Algorithms and Average Time Bounds of Sorting on a Mesh-Connected Computer

IEEE Transactions on Parallel and Distributed Systems
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Comparing and Combining Read Miss Clustering and Software Prefetching

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Cost-Effective Compiler Directed Memory Prefetching and Bypassing

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Early Experiences with the Myricom 2000 Switch on an SMP Beowulf-Class Cluster for Unstructured Adaptive Meshing

CLUSTER '01 Proceedings of the 3rd IEEE International Conference on Cluster Computing
Neighborhood Prefetching on Multiprocessors Using Instruction History

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Cache Injection on Bus Based Multiprocessors

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Fast sorting algorithms on reconfigurable array of processors with optical buses

ICPADS '96 Proceedings of the 1996 International Conference on Parallel and Distributed Systems
Optimal Parallel Merging Algorithms on BSR

ISPAN '00 Proceedings of the 2000 International Symposium on Parallel Architectures, Algorithms and Networks
Algorithms for Stable Sorting to Minimize Communications in Networks of Workstations and Their Implementations in BSP

IWCC '99 Proceedings of the 1st IEEE Computer Society International Workshop on Cluster Computing
Brazos: a third generation DSM system

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997

Parallel merging with restriction

The Journal of Supercomputing
Application of self organizing maps for investigating network latency on a broadcast-based distributed shared memory multiprocessor

Expert Systems with Applications: An International Journal
Merging data records on EREW PRAM

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
A new light-based solution to the Hamiltonian path problem

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper presents the multiprocessor architecture of the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), and examines the performance of representative algorithms for matrix operations, merging and sorting, using the message-passing and distributed-shared-memory paradigms. It shows that simple enhancements to the network interface and the cache and directory controllers can result in communication time of O(1) for the matrix-vector multiplication algorithm using DSM. The SOME-Bus is a low-latency, high-bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over 100 nodes. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. Each of P nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The entire P receiver array can be integrated on a single chip at a comparatively minor cost resulting in O(P) complexity. The SOME-Bus has much more functionality than a crossbar by supporting multiple simultaneous broadcasts of messages, allowing cache consistency protocols to complete much faster.