Clustering and understanding documents via discrimination information maximization

Authors:
Malik Tahir Hassan;Asim Karim
Affiliations:
Dept. of Computer Science, LUMS School of Science and Engineering, Lahore, Pakistan;Dept. of Computer Science, LUMS School of Science and Engineering, Lahore, Pakistan
Venue:
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2012

Citing 17
Cited 0

Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Relative risk and odds ratio: a data mining perspective

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Criterion functions for document clustering

Criterion functions for document clustering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Adaptive dimension reduction using discriminant analysis and K-means clustering

Proceedings of the 24th international conference on Machine learning
Mining statistically important equivalence classes and delta-discriminative emerging patterns

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning semantic relatedness from term discrimination information

Expert Systems with Applications: An International Journal
A Robust Discriminative Term Weighting Based Linear Discriminant Method for Text Classification

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
A comparison of extrinsic clustering evaluation metrics based on formal constraints

Information Retrieval
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Non-classical lexical semantic relations

CLS '04 Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics
A comparative study of ontology based term similarity measures on PubMed document clustering

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Locally Consistent Concept Factorization for Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Comparing dimension reduction techniques for document clustering

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text document clustering is a popular task for understanding and summarizing large document collections. Besides the need for efficiency, document clustering methods should produce clusters that are readily understandable as collections of documents relating to particular contexts or topics. Existing clustering methods often ignore term-document semantics while relying upon geometric similarity measures. In this paper, we present an efficient iterative partitional clustering method, CDIM, that maximizes the sum of discrimination information provided by documents. The discrimination information of a document is computed from the discrimination information provided by the terms in it, and term discrimination information is estimated from the currently labeled document collection. A key advantage of CDIM is that its clusters are describable by their highly discriminating terms --- terms with high semantic relatedness to their clusters' contexts. We evaluate CDIM both qualitatively and quantitatively on ten text data sets. In clustering quality evaluation, we find that CDIM produces high-quality clusters superior to those generated by the best methods. We also demonstrate the understandability provided by CDIM, suggesting its suitability for practical document clustering.