GoSCAN: Decentralized scalable data clustering

Authors:
Hoda Mashayekhi;Jafar Habibi;Spyros Voulgaris;Maarten Steen
Affiliations:
Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Department of Computer Science, VU University, Amsterdam, The Netherlands;Department of Computer Science, VU University, Amsterdam, The Netherlands
Venue:
Computing
Year:
2013

Citing 25
Cited 0

Epidemic algorithms for replicated database maintenance

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
The SEQUOIA 2000 storage benchmark

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Data Structures for Range Searching

ACM Computing Surveys (CSUR)
Distributed data clustering can be efficient and exact

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets

Distributed and Parallel Databases - Special issue: Parallel and distributed data mining
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Scalable density-based distributed clustering

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
A privacy-sensitive approach to distributed clustering

Pattern Recognition Letters - Special issue: Advances in pattern recognition
Effective and Efficient Distributed Model-Based Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
PENS: an algorithm for density-based clustering in peer-to-peer systems

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Distributed Data Mining in Peer-to-Peer Networks

IEEE Internet Computing
Gossip-based peer sampling

ACM Transactions on Computer Systems (TOCS)
Distributed classification in peer-to-peer networks

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Proactive gossip-based management of semantic overlay networks: Research Articles

Concurrency and Computation: Practice & Experience - Parallel and Distributed Computing (EuroPar 2005)
A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization

IEEE Transactions on Knowledge and Data Engineering
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network

IEEE Transactions on Knowledge and Data Engineering
Lightweight clustering technique for distributed data mining applications

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Distributed data clustering in multi-dimensional peer-to-peer networks

ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Scalable local density-based distributed clustering

Expert Systems with Applications: An International Journal
Approximated clustering of distributed high-dimensional data

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Association rule mining in peer-to-peer systems

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost.