A probabilistic model for clustering text documents with multiple fields

Authors:
Shanfeng Zhu;Ichigaku Takigawa;Shuqin Zhang;Hiroshi Mamitsuka
Affiliations:
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan;Department of Mathematics, The University of Hong Kong, Hong Kong;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 8
Cited 3

Data clustering: a review

ACM Computing Surveys (CSUR)
An experimental comparison of model-based clustering methods

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries

COACH: cumulative online algorithm for classification of handwriting deficiencies

IAAI'08 Proceedings of the 20th national conference on Innovative applications of artificial intelligence - Volume 3
Text clustering approach based on maximal frequent term sets

SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
Machine learning using Bernoulli mixture models: Clustering, rule extraction and dimensionality reduction

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of clustering documents with multiple fields, such as scientific literature with the distinct fields: title, abstract, keywords, main text and references. By taking into consideration of the distinct word distributions of each field, we propose a new probabilistic model, Field Independent Clustering Model (FICM), for clustering documents with multiple fields. The benefits of FICM come not only from integrating the discrimination abilities of each field but also from the power of selecting the most suitable component probabilistic model for each field. We examined the performance of FICM on the problem of clustering biomedical documents with three fields (title, abstract and MeSH). From the genomics track data of TREC 2004 and TREC 2005, we randomly generated 60 datasets where the number of classes in each dataset ranged from 3 to 12. By applying the appropriate configuration of generative models for each field, FICM outperformed a classical multinomial model in 59 out of the total 60 datasets, of which 47 were statistically significant at the 95% level, and FICM also outperformed a multivariate Bernoulli model in 52 out of the total 60 datasets, of which 36 were statistically significant at the 95% level.