On Knowledge-Enhanced Document Clustering

Authors:
Manjeet Rege;Josan Koruthu;Reynold Bailey
Affiliations:
Rochester Institute of Technology, Rochester, NY, USA;Rochester Institute of Technology, Rochester, NY, USA;Rochester Institute of Technology, Rochester, NY, USA
Venue:
International Journal of Information Retrieval Research
Year:
2012

Citing 26
Cited 0

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
WordNet: a lexical database for English

Communications of the ACM
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Document classification through interactive supervision of document and term labels

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Semi-supervised graph clustering: a kernel approach

ICML '05 Proceedings of the 22nd international conference on Machine learning
Isoperimetric Graph Partitioning for Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Isoperimetric Partitioning: A New Algorithm for Graph Partitioning

SIAM Journal on Scientific Computing
Document clustering with prior knowledge

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Text clustering with extended user feedback

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Incorporating User Provided Constraints into Document Clustering

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Text Mining: Classification, Clustering, and Applications

Text Mining: Classification, Clustering, and Applications
Text classification by labeling words

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
InterActive feature selection

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Modern Information Retrieval

Modern Information Retrieval
Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering plays an important role in text analytics by finding natural groupings of documents based on their similarity determined by the words appearing in them. Many of the clustering algorithms accessible through various text analytics tools are completely unsupervised in nature. That is, they are unable to incorporate any domain knowledge that might be available about the documents to improve the clustering accuracy and relevance. The authors present a graph partitioning based semi-supervised document clustering algorithm. The user provides knowledge about few of the documents in the form of "must-link" and "cannot-link" constraints between pairs of documents. A "must-link" constraint between two documents expresses the fact that the user feels that the two corresponding documents must be clustered irrespective of their dissimilarity. Similarly, a "cannot-link" signifies that the two documents should never be clustered together no matter how similar they might happen to be. These constraints are then incorporated into a graph partitioning based into a computationally efficient document clustering algorithm. Through experiments performed on publicly available text datasets, the proposed framework is validated.