Semantic Smoothing for Model-based Document Clustering

Authors:
Xiaodan Zhang;Xiaohua Zhou;Xiaohua Hu
Affiliations:
Drexel University;Drexel University;Drexel University
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 7

A comparative evaluation of different link types on enhancing document clustering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Clustering Massive Text Data Streams by Semantic Smoothing Model

ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Document Clustering by Semantic Smoothing and Dynamic Growing Cell Structure (DynGCS) for Biomedical Literature

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
PhraseRank for document clustering: reweighting the weight of phrase

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
A comparative study of ontology based term similarity measures on PubMed document clustering

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
A comparison of machine learning techniques for detection of drug target articles

Journal of Biomedical Informatics
Data stream clustering: A survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A document is often full of class-independent "general" words and short of class-specific 'core" words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.