Field independent probabilistic model for clustering multi-field documents

Authors:
Shanfeng Zhu;Ichigaku Takigawa;Jia Zeng;Hiroshi Mamitsuka
Affiliations:
Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China and School of Computer Science, Fudan University, 220 Handan Road, Shanghai 200433, China;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan;Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
Venue:
Information Processing and Management: an International Journal
Year:
2009

Citing 14
Cited 1

Data clustering: a review

ACM Computing Surveys (CSUR)
An experimental comparison of model-based clustering methods

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Neural Network Ensembles

IEEE Transactions on Pattern Analysis and Machine Intelligence
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Multiclassifier Systems: Back to the Future

MCS '02 Proceedings of the Third International Workshop on Multiple Classifier Systems
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
On Rival Penalization Controlled Competitive Learning for Clustering with Automatic Cluster Number Selection

IEEE Transactions on Knowledge and Data Engineering
A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Enhancing clustering blog documents by utilizing author/reader comments

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Inference and evaluation of the multinomial mixture model for text clustering

Information Processing and Management: an International Journal

On the map: Nature and Science editorials

Scientometrics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, I., Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331-342)].