iVIBRATE: Interactive visualization-based framework for clustering large datasets

Authors:
Keke Chen;Ling Liu
Affiliations:
Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2006

Citing 42
Cited 7

The grand tour: a tool for viewing multidimensional data

SIAM Journal on Scientific and Statistical Computing
Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Algorithms for clustering data

Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Applied multivariate techniques

Applied multivariate techniques
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Almost-constant-time clustering of arbitrary corpus subsets4

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
DNA visual and analytic data mining

VIS '97 Proceedings of the 8th conference on Visualization '97
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Interactive Internet search through automatic clustering (poster abstract): an empirical study

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Interactive exploration of very large relational datasets through 3D dynamic projections

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Geometric methods and applications: for computer science and engineering

Geometric methods and applications: for computer science and engineering
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Visual exploration of large data sets

Communications of the ACM
Visualizing multi-dimensional clusters, trends, and outliers using star coordinates

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Feature Extraction, Construction and Selection: A Data Mining Perspective

Feature Extraction, Construction and Selection: A Data Mining Perspective
Modern Information Retrieval

Modern Information Retrieval
Cluster validity methods: part I

ACM SIGMOD Record
HD-Eye: Visual Mining of High-Dimensional Data

IEEE Computer Graphics and Applications
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Interactively Exploring Hierarchical Clustering Results

Computer
Clustering for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Multidimensional detective

INFOVIS '97 Proceedings of the 1997 IEEE Symposium on Information Visualization (InfoVis '97)
Inventing discovery tools: combining information visualization with data mining

Information Visualization
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Exploring N-dimensional databases

VIS '90 Proceedings of the 1st conference on Visualization '90
XmdvTool: integrating multiple methods for visualizing multivariate data

VIS '94 Proceedings of the conference on Visualization '94
ClusterMap: labeling clusters in large datasets via visualization

Proceedings of the thirteenth ACM international conference on Information and knowledge management
A Distributed Approach to Node Clustering in Decentralized Peer-to-Peer Networks

IEEE Transactions on Parallel and Distributed Systems
VISTA: validating and refining clusters via visualization

Information Visualization
The "Best K" for entropy-based categorical data clustering

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management

Semi-supervised visual clustering for spherical coordinates systems

Proceedings of the 2008 ACM symposium on Applied computing
A Prediction-Based Visual Approach for Cluster Exploration and Cluster Validation by HOV3

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Improved Visual Clustering through Unsupervised Dimensionality Reduction

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
CloudVista: visual cluster exploration for extreme scale data in the cloud

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
iDVS: an interactive multi-document visual summarization system

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
iVisClustering: An Interactive Visual Document Clustering via Topic Modeling

Computer Graphics Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

With continued advances in communication network technology and sensing technology, there is astounding growth in the amount of data produced and made available through cyberspace. Efficient and high-quality clustering of large datasets continues to be one of the most important problems in large-scale data analysis. A commonly used methodology for cluster analysis on large datasets is the three-phase framework of sampling/summarization, iterative cluster analysis, and disk-labeling. There are three known problems with this framework which demand effective solutions. The first problem is how to effectively define and validate irregularly shaped clusters, especially in large datasets. Automated algorithms and statistical methods are typically not effective in handling these particular clusters. The second problem is how to effectively label the entire data on disk (disk-labeling) without introducing additional errors, including the solutions for dealing with outliers, irregular clusters, and cluster boundary extension. The third obstacle is the lack of research about issues related to effectively integrating the three phases. In this article, we describe iVIBRATE---an interactive visualization-based three-phase framework for clustering large datasets. The two main components of iVIBRATE are its VISTA visual cluster-rendering subsystem which invites human interplay into the large-scale iterative clustering process through interactive visualization, and its adaptive ClusterMap labeling subsystem which offers visualization-guided disk-labeling solutions that are effective in dealing with outliers, irregular clusters, and cluster boundary extension. Another important contribution of iVIBRATE development is the identification of the special issues presented in integrating the two components and the sampling approach into a coherent framework, as well as the solutions for improving the reliability of the framework and for minimizing the amount of errors generated within the cluster analysis process. We study the effectiveness of the iVIBRATE framework through a walkthrough example dataset of a million records and we experimentally evaluate the iVIBRATE approach using both real-life and synthetic datasets. Our results show that iVIBRATE can efficiently involve the user in the clustering process and generate high-quality clustering results for large datasets.