A Partitioning Strategy for Nonuniform Problems on Multiprocessors
IEEE Transactions on Computers
Parallel database systems: the future of high performance database systems
Communications of the ACM
Multidimensional access methods
ACM Computing Surveys (CSUR)
Approaches for scaling DBSCAN algorithm to large spatial databases
Journal of Computer Science and Technology
A Fast Parallel Clustering Algorithm for Large Spatial Databases
Data Mining and Knowledge Discovery
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Experiments in Parallel Clustering with DBSCAN
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Scalable density-based distributed clustering
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Enabling rapid development of parallel tree search applications
Proceedings of the 5th IEEE workshop on Challenges of large applications in distributed environments
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system
Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Distributed aggregation for data-parallel computing: interfaces and implementations
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PLANET: massively parallel learning of tree ensembles with MapReduce
Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study
Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
Hybrid merge/overlap execution technique for parallel array processing
Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
ArrayStore: a storage manager for complex parallel array processing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The case for being lazy: how to leverage lazy evaluation in MapReduce
Proceedings of the 2nd international workshop on Scientific cloud computing
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
How to price shared optimizations in the cloud
Proceedings of the VLDB Endowment
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SkewTune in action: mitigating skew in MapReduce applications
Proceedings of the VLDB Endowment
Simulation of database-valued markov chains using SimSQL
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
Frontiers of Computer Science: Selected Publications from Chinese Universities
Hi-index | 0.00 |
Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.