Unsupervised Documents Categorization Using New Threshold-Sensitive Weighting Technique

  • Authors:
  • Frederic Ehrler;Patrick Ruch

  • Affiliations:
  • Artificial Intelligence laboratory, University of Geneva, Geneva, Switzerland and Medical Informatics Services, University Hospital of Geneva, Geneva, Switzerland;Medical Informatics Services, University Hospital of Geneva, Geneva, Switzerland

  • Venue:
  • AIME '07 Proceedings of the 11th conference on Artificial Intelligence in Medicine
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the number of published documents increase quickly, there is a crucial need for fast and sensitive categorization methods to manage the produced information. In this paper, we focused on the categorization of biomedical documents with concepts of the Gene Ontology, an ontology dedicated to gene description. Our approach discovers associations between the predefined concepts and the documents using string matching techniques. The assignations are ranked according to a score computed given several strategies. The effects of these different scoring strategies on the categorization effectiveness are evaluated. More especially a new weighting technique based on term frequency is presented. This new weighting technique improves the categorization effectiveness on most of the experiment performed. This paper shows that a cleaver use of the frequency can bring substantial benefits when performing automatic categorization on large collection of documents.