Document clustering using synthetic cluster prototypes

Authors:
Argyris Kalogeratos;Aristidis Likas
Affiliations:
-;-
Venue:
Data & Knowledge Engineering
Year:
2011

Citing 37
Cited 4

A vector space model for automatic indexing

Communications of the ACM
Concept decompositions for large sparse text data using clustering

Machine Learning
Unsupervised Feature Selection Using Feature Similarity

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Feature Weighting in k-Means Clustering

Machine Learning
Iterative Clustering of High Dimensional Text Data Augmented by Local Search

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Latent dirichlet allocation

The Journal of Machine Learning Research
K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization

Proceedings of the 2004 ACM symposium on Applied computing
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A local search approximation algorithm for k-means clustering

Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometry—SoCG2002
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Locally adaptive metrics for clustering high dimensional data

Data Mining and Knowledge Discovery
An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data

IEEE Transactions on Knowledge and Data Engineering
Best of both: a hybridized centroid-medoid clustering heuristic

Proceedings of the 24th international conference on Machine learning
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm

Computational Statistics & Data Analysis
An active learning framework for semi-supervised document clustering with language modeling

Data & Knowledge Engineering
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Initializing Partition-Optimization Algorithms

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficiently tracing clusters over high-dimensional on-line data streams

Data & Knowledge Engineering
Unsupervised feature selection for multi-cluster data

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Multilabel dimensionality reduction via dependence maximization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Exploiting word cluster information for unsupervised feature selection

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
A significance-based graph model for clustering web documents

SETN'06 Proceedings of the 4th Helenic conference on Advances in Artificial Intelligence
Subspace clustering of text documents with feature weighting k-means algorithm

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Unsupervised feature selection for text data

ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Editorial: Classifying text streams by keywords using classifier ensemble

Data & Knowledge Engineering
A unique property of single-link distance and its application in data clustering

Data & Knowledge Engineering
SBV-Cut: Vertex-cut based graph partitioning using structural balance vertices

Data & Knowledge Engineering
An architecture for component-based design of representative-based clustering algorithms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The use of centroids as prototypes for clustering text documents with the k-means family of methods is not always the best choice for representing text clusters due to the high dimensionality, sparsity, and low quality of text data. Especially for the cases where we seek clusters with small number of objects, the use of centroids may lead to poor solutions near the bad initial conditions. To overcome this problem, we propose the idea of synthetic cluster prototype that is computed by first selecting a subset of cluster objects (instances), then computing the representative of these objects and finally selecting important features. In this spirit, we introduce the MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster. These synthetic cluster prototypes are incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp). Comparative experimental evaluation demonstrates the robustness of the approach especially for small datasets and clusters overlapping in many dimensions and its superior performance against traditional and subspace clustering methods.