Parallel algorithms for generating random permutations on a shared memory machine
SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Random permutations on distributed, external and hierarchical memory
Information Processing Letters
Randomized fully-scalable BSP techniques for multi-searching and convex hull construction
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Algorithm 235: Random permutation
Communications of the ACM
Randomized permutations in a coarse grained parallel environment
Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures
Efficient sampling of random permutations
Journal of Discrete Algorithms
Estimating the size of the transitive closure in linear time
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
Multilevel circuit partitioning
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.02 |
We tackle the feasibility and efficiency of two new parallel algorithms that sample random permutations of the integers [M] = {1, ..., M} . The first reduces the communication for p processors from O(M) words (O(M logM) bits, the coding size of the permutation) to O(M log p/ logM) words (O(M log p) bits, the coding size of a partition of [M] into M/p sized subsets). The second exploits the common case of using pseudo-random numbers instead of real randomness. It reduces the communication even further to a use of bandwidth that is proportional to the used real randomness. Careful engineering of the required subroutines is necessary to obtain a competitive implementation. Especially the second approach shows very good results which are demonstrated by large scale experiments. It shows high scalability and outperforms the previously known approaches by far. First, we compare our algorithm to the classical sequential data shuffle algorithm, where we get a speedup of about 1.5. Then, we show how the algorithm parallelizes well on a multicore system and scales to a cluster of 440 cores.