Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
We provide efficient single-precision and integer GPU implementations of Strassen's algorithm as well as of Winograd's variant. On an NVIDIA C1060 GPU, a speedup of 32% (35%) is obtained for Strassen's 4-level implementation and 33% (36%) for Winograd's variant relative to the sgemm (integer version of sgemm) code in CUBLAS 3.0 when multiplying 16384脳16384 matrices. The maximum numerical error for the single-precision implementations is about 2 orders of magnitude higher than those for sgemm when n = 16384 and is zero for the integer implementations.