DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

Authors:
Younghoon Kim;Kyuseok Shim;Min-Soeng Kim;June Sup Lee
Affiliations:
-;-;-;-
Venue:
Information Systems
Year:
2014

Citing 26
Cited 0

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
A retrieval technique for similar shapes

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
On packing R-trees

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Efficient and effective querying by image content

Journal of Intelligent Information Systems - Special issue: advances in visual information management systems
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Determining the Epipolar Geometry and its Uncertainty: A Review

International Journal of Computer Vision
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Cure: an efficient clustering algorithm for large databases

Information Systems
Introduction to algorithms

Introduction to algorithms
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
High-Dimensional Similarity Joins

IEEE Transactions on Knowledge and Data Engineering
High-Dimensional Similarity Joins

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Fast Nearest Neighbor Search in Medical Image Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Experiments in Parallel Clustering with DBSCAN

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Dynamic Programming

Dynamic Programming
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
l-DBSCAN: A Fast Hybrid Density Based Clustering Method

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
GNU Scientific Library Reference Manual - Third Edition

GNU Scientific Library Reference Manual - Third Edition
Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Pattern Recognition Letters
P-DBSCAN: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos

Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application
An agent-based approach to care in independent living

AmI'10 Proceedings of the First international joint conference on Ambient intelligence
Research on Clustering Algorithm and Its Parallelization Strategy

ICCIS '11 Proceedings of the 2011 International Conference on Computational and Information Sciences
MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce

ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention. In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework.