Enhancing the Effectiveness of Clustering with Spectra Analysis

Authors:
Wenyuan Li;Wee-Keong Ng;Ying Liu;Kok-Leong Ong
Affiliations:
-;IEEE Computer Society;IEEE Computer Society;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 15
Cited 9

Eigen values and expanders

Combinatorica
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Data clustering: a review

ACM Computing Surveys (CSUR)
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Spectral analysis of data

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
CLOPE: a fast and effective clustering algorithm for transactional data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On clusterings-good, bad and spectral

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Minimum Entropy Clustering and Applications to Gene Expression Analysis

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference

Spectral geometry for simultaneously clustering and ranking query search results

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Design of interpretable fuzzy rule-based classifiers using spectral analysis with structure and parameters optimization

Fuzzy Sets and Systems
K-means clustering versus validation measures: a data-distribution perspective

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Simultaneous ranking and clustering of sentences: a reinforcement approach to multi-document summarization

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
The optimum clustering framework: implementing the cluster hypothesis

Information Retrieval
Constructing affinity matrix in spectral clustering based on neighbor propagation

Neurocomputing
Probability-based text clustering algorithm by alternately repeating two operations

Journal of Information Science
Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization

Information Sciences: an International Journal
Sequential Summarization: A Full View of Twitter Trending Topics

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

For many clustering algorithms, such as K-Means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters, that is, k, to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images, or biological data. In an effort to improve the effectiveness of clustering, we seek the answer to a fundamental question: How can we effectively estimate the number of clusters in a given data set? We propose an efficient method based on spectra analysis of eigenvalues (not eigenvectors) of the data set as the solution to the above. We first present the relationship between a data set and its underlying spectra with theoretical and experimental results. We then show how our method is capable of suggesting a range of k that is well suited to different analysis contexts. Finally, we conclude with further empirical results to show how the answer to this fundamental question enhances the clustering process for large text collections.