Semantic smoothing of document models for agglomerative clustering

Authors:
Xiaohua Zhou;Xiaodan Zhang;Xiaohua Hu
Affiliations:
Drexel University, College of Information Science & Technology;Drexel University, College of Information Science & Technology;Drexel University, College of Information Science & Technology
Venue:
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Year:
2007

Citing 9
Cited 6

Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Two-stage language models for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Integrating word relationships into language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Integrating Compound Terms in Bayesian Text Classification

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Context-sensitive semantic smoothing for the language modeling approach to genomic IR

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

A comparative evaluation of different link types on enhancing document clustering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Concept Level Web Search Via Semantic Clustering

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Neurolinguistic approach to natural language processing with applications to medical text analysis

Neural Networks
Leveraging network structure for incremental document clustering

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Discovering K web user groups with specific aspect interests

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Semantic smoothing for text clustering

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we argue that the agglomerative clustering with vector cosine similarity measure performs poorly due to two reasons. First, the nearest neighbors of a document belong to different classes in many cases since any pair of documents shares lots of "general" words. Second, the sparsity of class-specific "core" words leads to grouping documents with the same class labels into different clusters. Both problems can be resolved by suitable smoothing of document model and using Kullback-Leibler divergence of two smoothed models as pairwise document distances. Inspired by the recent work in information retrieval, we propose a novel context-sensitive semantic smoothing method that can automatically identifies multiword phrases in a document and then statistically map phrases to individual document terms. We evaluate the new model-based similarity measure on three datasets using complete linkage criterion for agglomerative clustering and find out it significantly improves the clustering quality over the traditional vector cosine measure.