Performance evaluation of basic linear algebra subroutines on a matrix co-processor
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Hi-index | 0.00 |
Previously, we represented the index space of the (n脳n)- matrix multiply-add problem C=C+A脳B as a 3D torus, where A, B, and C are rolled along the corresponding axes of the index space. All optimal 2D data allocations (resulted from projection) to solve the problem on the n脳n torus array processor in n multiply-add-roll steps were obtained. In this paper, we formulate the operations needed for aligning both the data before computing and the results after computing as matrix multiply-add problems. These alignment operations are combined with the optimal data allocations that solve the matrix multiply-add problem to propose new algorithms to transpose an n脳n matrix on the n脳n torus array processor in O(n) multiply-add-roll steps. Using the proposed algorithms, we showed different approaches to solve the transposed matrix multiply-add problem, C=C+A^T脳B^T , on the 2D torus array processor.