Field independent probabilistic model for clustering multi-field documents

  • Authors:
  • Shanfeng Zhu;Ichigaku Takigawa;Jia Zeng;Hiroshi Mamitsuka

  • Affiliations:
  • Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China and School of Computer Science, Fudan University, 220 Handan Road, Shanghai 200433, China;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan;Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong;Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, I., Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331-342)].