Information retrieval as statistical translation
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage language models for information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Integrating word relationships into language models
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Generative model-based document clustering: a comparative study
Knowledge and Information Systems
Integrating Compound Terms in Bayesian Text Classification
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Context-sensitive semantic smoothing for the language modeling approach to genomic IR
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A comparative evaluation of different link types on enhancing document clustering
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Concept Level Web Search Via Semantic Clustering
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Leveraging network structure for incremental document clustering
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Discovering K web user groups with specific aspect interests
MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Semantic smoothing for text clustering
Knowledge-Based Systems
Hi-index | 0.00 |
In this paper, we argue that the agglomerative clustering with vector cosine similarity measure performs poorly due to two reasons. First, the nearest neighbors of a document belong to different classes in many cases since any pair of documents shares lots of "general" words. Second, the sparsity of class-specific "core" words leads to grouping documents with the same class labels into different clusters. Both problems can be resolved by suitable smoothing of document model and using Kullback-Leibler divergence of two smoothed models as pairwise document distances. Inspired by the recent work in information retrieval, we propose a novel context-sensitive semantic smoothing method that can automatically identifies multiword phrases in a document and then statistically map phrases to individual document terms. We evaluate the new model-based similarity measure on three datasets using complete linkage criterion for agglomerative clustering and find out it significantly improves the clustering quality over the traditional vector cosine measure.