ClusterMap: labeling clusters in large datasets via visualization

  • Authors:
  • Keke Chen;Ling Liu

  • Affiliations:
  • Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA

  • Venue:
  • Proceedings of the thirteenth ACM international conference on Information and knowledge management
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

With the rapid increase of data in many areas, clustering on large datasets has become an important problem in data analysis. Since cluster analysis is a highly iterative process, cluster analysis on large datasets prefers short iteration on a relatively small representative set. Thus, a two-phase framework "sampling/summarization - iterative cluster analysis" is often applied in practice. Since the clustering result only labels the small representative set, there are problems with extending the result to the entire large dataset, which are almost ignored by the traditional clustering research. This extending is often named as labeling process. Labeling irregular shaped clusters, distinguishing outliers and extending cluster boundary are the main problems in this stage. We address these problems and propose a visualization-based approach to dealing with them precisely. This approach partially involves human into the process of defining and refining the structure "ClusterMap". Based on this structure, the ClusterMap algorithm scans the large dataset to adapt the boundary extension and generate the cluster labels for the entire dataset. Experimental result shows that ClusterMap can preserve cluster quality considerably with low computational cost, compared to the distance-comparison-based labeling algorithms.