Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

Authors:
YongChul Kwon;Dylan Nunley;Jeffrey P. Gardner;Magdalena Balazinska;Bill Howe;Sarah Loebman
Affiliations:
University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA
Venue:
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Year:
2010

Citing 20
Cited 11

A Partitioning Strategy for Nonuniform Problems on Multiprocessors

IEEE Transactions on Computers
Parallel database systems: the future of high performance database systems

Communications of the ACM
Multidimensional access methods

ACM Computing Surveys (CSUR)
Approaches for scaling DBSCAN algorithm to large spatial databases

Journal of Computer Science and Technology
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Experiments in Parallel Clustering with DBSCAN

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Scalable density-based distributed clustering

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Enabling rapid development of parallel tree search applications

Proceedings of the 5th IEEE workshop on Challenges of large applications in distributed environments
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study

Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
Hybrid merge/overlap execution technique for parallel array processing

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
Recipes for baking black forest databases: building and querying black hole merger trees from cosmological simulations

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
How to price shared optimizations in the cloud

Proceedings of the VLDB Endowment
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SkewTune in action: mitigating skew in MapReduce applications

Proceedings of the VLDB Endowment
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.