Efficient algorithms for finding maximum matching in graphs
ACM Computing Surveys (CSUR)
Scheduling nonuniform traffic in a packet-switching system with small propagation delay
IEEE/ACM Transactions on Networking (TON)
On the distributed complexity of computing maximal matchings
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Matrices Associated With the Hitchcock Problem
Journal of the ACM (JACM)
An Efficient Implementation of Edmonds' Algorithm for Maximum Matching on Graphs
Journal of the ACM (JACM)
Matching is as easy as matrix inversion
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Switch Design to Enable Predictive Multiplexed Switching in Multiprocessor Networks
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Distributed algorithm for better approximation of the maximum matching
COCOON'03 Proceedings of the 9th annual international conference on Computing and combinatorics
Allocator implementations for network-on-chip routers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Packet chaining: efficient single-cycle allocation for on-chip networks
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
The maximum matching algorithm for bipartite graphs can be used to provide optimal scheduling for crossbar based interconnection networks. Unfortunately, maximum matching requires O(N3) time for an N x N communication system, which has limited its application to real-time network scheduling. In this paper, we show how maximum matching can be reformulated in terms of Boolean operations rather than the more traditional formulations. By taking advantage of the inherent parallelism available in custom hardware design, we introduce three Maximum Matching implementations in hardware and show how we can trade design complexity for performance. Specifically, we examine a Pure Logic Scheduler with three dimensions of parallelism, a Matrix Scheduler with two dimensions of parallelism and a Vector Scheduler with one dimension of parallelism. These designs reduce the algorithmic time complexity down to O(1), O(K), and O(KN), respectively, where K is the number of optimization steps. While an optimal scheduling algorithm requires K = 2N -- 1 steps, our simulation results show that the scheduler can achieve 99% of the optimal schedule when K = 9. We examine hardware and time complexity of these architectures for crossbar sizes of up to N = 1024. Using FPGA synthesis results, we show that a greedy schedule for various sized crossbars, ranging from 8 x 8 to 256 x 256, can be optimized in less than 20 ns per optimization step. For crossbars reaching 1024 x 1024 the scheduling can be completed in approximately 10 μs with current technology and could reach under 90 ns with future technologies.