ClusterMap: labeling clusters in large datasets via visualization

Authors:
Keke Chen;Ling Liu
Affiliations:
Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 20
Cited 3

Algorithms for clustering data

Algorithms for clustering data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
Interactive exploration of very large relational datasets through 3D dynamic projections

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Geometric methods and applications: for computer science and engineering

Geometric methods and applications: for computer science and engineering
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Visual exploration of large data sets

Communications of the ACM
Visualizing multi-dimensional clusters, trends, and outliers using star coordinates

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster validity methods: part I

ACM SIGMOD Record
HD-Eye: Visual Mining of High-Dimensional Data

IEEE Computer Graphics and Applications
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Clustering for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Inventing discovery tools: combining information visualization with data mining

Information Visualization
Validating and Refining Clusters via Visual Rendering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Cluster rendering of skewed datasets via visualization

Proceedings of the 2003 ACM symposium on Applied computing
A visual framework invites human into the clustering process

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management

iVIBRATE: Interactive visualization-based framework for clustering large datasets

ACM Transactions on Information Systems (TOIS)
Exploiting parallelism to support scalable hierarchical clustering

Journal of the American Society for Information Science and Technology
Visualization and clustering of crowd video content in MPCA subspace

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.01

Visualization

Abstract

With the rapid increase of data in many areas, clustering on large datasets has become an important problem in data analysis. Since cluster analysis is a highly iterative process, cluster analysis on large datasets prefers short iteration on a relatively small representative set. Thus, a two-phase framework "sampling/summarization - iterative cluster analysis" is often applied in practice. Since the clustering result only labels the small representative set, there are problems with extending the result to the entire large dataset, which are almost ignored by the traditional clustering research. This extending is often named as labeling process. Labeling irregular shaped clusters, distinguishing outliers and extending cluster boundary are the main problems in this stage. We address these problems and propose a visualization-based approach to dealing with them precisely. This approach partially involves human into the process of defining and refining the structure "ClusterMap". Based on this structure, the ClusterMap algorithm scans the large dataset to adapt the boundary extension and generate the cluster labels for the entire dataset. Experimental result shows that ClusterMap can preserve cluster quality considerably with low computational cost, compared to the distance-comparison-based labeling algorithms.