LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
Global arrays: a nonuniform memory access programming model for high-performance computers
The Journal of Supercomputing
Fast runtime block cyclic data redistribution on multiprocessors
Journal of Parallel and Distributed Computing
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
IEEE Parallel & Distributed Technology: Systems & Technology
Python for Scientific Computing
Computing in Science and Engineering
IPython: A System for Interactive Scientific Computing
Computing in Science and Engineering
GPAW optimized for Blue Gene/P using hybrid programming
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
CEFP'11 Proceedings of the 4th Summer School conference on Central European Functional Programming School
Hi-index | 0.00 |
In this paper, we introduce DistNumPy, a library for doing numerical computation in Python that targets scalable distributed memory architectures. DistNumPy extends the NumPy module[15], which is popular for scientific programming. Replacing NumPy with Dist-NumPy enables the user to write sequential Python programs that seamlessly utilize distributed memory architectures. This feature is obtained by introducing a new backend for NumPy arrays, which distribute data amongst the nodes in a distributed memory multi-processor. All operations on this new array will seek to utilize all available processors. The array itself is distributed between multiple processors in order to support larger arrays than a single node can hold in memory. We perform three experiments of sequential Python programs running on an Ethernet based cluster of SMP-nodes with a total of 64 CPU-cores. The results show an 88% CPU utilization when running a Monte Carlo simulation, 63% CPU utilization on an N-body simulation and a more modest 50% on a Jacobi solver. The primary limitation in CPU utilization is identified as SMP limitations and not the distribution aspect. Based on the experiments we find that it is possible to obtain significant speedup from using our new array-backend without changing the original Python code.