DECODE: a new method for discovering clusters of different densities in spatial data

Authors:
Tao Pei;Ajay Jasra;David J. Hand;A. -Xing Zhu;Chenghu Zhou
Affiliations:
Institute of Geographical Sciences and Natural Resources Research, Beijing, China 100101 and Institute for Mathematical Sciences, Imperial College, London, UK SW7 2PG;Department of Mathematics, Imperial College, London, UK;Department of Mathematics and Institute for Mathematical Sciences, Imperial College, London, UK;Institute of Geographical Sciences and Natural Resources Research, Beijing, China 100101 and Department of Geography, University of Wisconsin Madison, Madison, USA 53706-1491;Institute of Geographical Sciences and Natural Resources Research, Beijing, China 100101
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 12
Cited 3

Algorithms for clustering data

Algorithms for clustering data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Monte Carlo Statistical Methods (Springer Texts in Statistics)

Monte Carlo Statistical Methods (Springer Texts in Statistics)
A New Density-Based Scheme for Clustering Based on Genetic Algorithm

Fundamenta Informaticae
Detection of spatial and spatio-temporal clusters

Detection of spatial and spatio-temporal clusters
KNN-kernel density-based clustering for high-dimensional multivariate data

Computational Statistics & Data Analysis
Non parametric local density-based clustering for multimodal overlapping distributions

IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
An approach to find embedded clusters using density based techniques

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology

Using structure-based data transformation method to improve prediction accuracies for small data sets

Decision Support Systems
Multi-scale decomposition of point process data

Geoinformatica
A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

When clusters with different densities and noise lie in a spatial point set, the major obstacle to classifying these data is the determination of the thresholds for classification, which may form a series of bins for allocating each point to different clusters. Much of the previous work has adopted a model-based approach, but is either incapable of estimating the thresholds in an automatic way, or limited to only two point processes, i.e. noise and clusters with the same density. In this paper, we present a new density-based cluster method (DECODE), in which a spatial data set is presumed to consist of different point processes and clusters with different densities belong to different point processes. DECODE is based upon a reversible jump Markov Chain Monte Carlo (MCMC) strategy and divided into three steps. The first step is to map each point in the data to its mth nearest distance, which is referred to as the distance between a point and its mth nearest neighbor. In the second step, classification thresholds are determined via a reversible jump MCMC strategy. In the third step, clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the thresholds. Four experiments, including two simulated data sets and two seismic data sets, are used to evaluate the algorithm. Results on simulated data show that our approach is capable of discovering the clusters automatically. Results on seismic data suggest that the clustered earthquakes, identified by DECODE, either imply the epicenters of forthcoming strong earthquakes or indicate the areas with the most intensive seismicity, this is consistent with the tectonic states and estimated stress distribution in the associated areas. The comparison between DECODE and other state-of-the-art methods, such as DBSCAN, OPTICS and Wavelet Cluster, illustrates the contribution of our approach: although DECODE can be computationally expensive, it is capable of identifying the number of point processes and simultaneously estimating the classification thresholds with little prior knowledge.