Bootstrapping for hierarchical document classification

  • Authors:
  • Giordano Adami;Paolo Avesani;Diego Sona

  • Affiliations:
  • ITC-irst, Povo, Italy;ITC-irst, Povo, Italy;ITC-irst, Povo, Italy

  • Venue:
  • CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Managing the hierarchical organization of data is starting to play a key role in the knowledge management community due to the great amount of human resources needed to create and maintain these organized repositories of information. Machine learning community has in part addressed this problem by developing hierarchical supervised classifiers that help maintainers to categorize new resources within given hierarchies. Although such learning models succeed in exploiting relational knowledge, they are highly demanding in terms of labeled examples, because the number of categories is related to the dimension of the corresponding hierarchy. Hence, the creation of new directories or the modification of existing ones require strong investments.This paper proposes a semi-automatic process (interleaved with human suggestions) whose aim is to minimize (simplify) the work required to the administrators when creating, modifying, and maintaining directories. Within this process, bootstrapping a taxonomy with examples represents a critical factor for the effective exploitation of any supervised learning model. For this reason we propose a method for the bootstrapping process that makes a first hypothesis of categorization for a set of unlabeled documents, with respect to a given empty hierarchy of concepts. Based on a revision of Self-Organizing Maps, namely TaxSOM, the proposed model performs an unsupervised classification, exploiting the a-priori knowledge encoded in a taxonomy structure both at the terminological and topological level. The ultimate goal of TaxSOM is to create the premise for successfully training a supervised classifier.