Multiprocessor out-of-core FFTs with distributed memory and parallel disks (extended abstract)
Proceedings of the fifth workshop on I/O in parallel and distributed systems
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Towards a theory of cache-efficient algorithms
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
Towards a theory of cache-efficient algorithms
Journal of the ACM (JACM)
An Efficient Algorithm for Out-of-Core Matrix Transposition
IEEE Transactions on Computers
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Handbook of massive data sets
Building on a Framework: Using FG for More Flexibility and Improved Performance in Parallel Programs
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Optimal sparse matrix dense vector multiplication in the I/O-model
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Efficient parallel out-of-core matrix transposition
International Journal of High Performance Computing and Networking
Combating I-O bottleneck using prefetching: model, algorithms, and ramifications
The Journal of Supercomputing
Algorithms and data structures for external memory
Foundations and Trends® in Theoretical Computer Science
Algorithmic ramifications of prefetching in memory hierarchy
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Hi-index | 0.00 |
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data.The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations.Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein.