BDMPI: conquering BigData with small clusters using MPI

Authors:
Dominique LaSalle;George Karypis
Affiliations:
University of Minnesota, Minneapolis, Minnesota;University of Minnesota, Minneapolis, Minnesota
Venue:
DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Year:
2013

Citing 9
Cited 0

A bridging model for parallel computation

Communications of the ACM
Communication strategies for out-of-core programs on distributed memory machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
Concept decompositions for large sparse text data using clustering

Machine Learning
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks

Proceedings of the 20th international conference on World wide web
GraphChi: large-scale graph computation on just a PC

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of processing massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we present a runtime system for traditional MPI programs that enables the efficient and transparent disk-based execution of distributed-memory parallel programs. This system, called BDMPI, leverages the semantics of MPI's API to orchestrate the execution of a large number of MPI processes on much fewer compute nodes, so that the running processes maximize the amount of computation that they perform with the data fetched from the disk. BDMPI enables the development of efficient parallel distributed memory disk-based codes without the high engineering and algorithmic complexities associated with multiple levels of blocking. BDMPI achieves significantly better performance than existing technologies on a single node (GraphChi) as well as on a small cluster (Hadoop).