Simple and accurate feature selection for hierarchical categorisation

  • Authors:
  • Wahyu Wibowo;Hugh E. Williams

  • Affiliations:
  • RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia

  • Venue:
  • Proceedings of the 2002 ACM symposium on Document engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical categorisation and, specifically, the role of features in the development of more effective categorisers. We show that a good hierarchical machine learning-based categoriser can be developed using small numbers of features from pre-categorised training documents. Overall, we show that by using a few terms, categorisation accuracy can be improved substantially: unstructured leaf level categorisation can be improved by up to 8.6%, while top-down hierarchical categorisation accuracy can be improved by up to 12%. In addition, unlike other feature selection models --- which typically require different feature selection parameters for categories at different hierarchical levels --- our technique works equally well for all categories in a hierarchical structure. We conclude that, in general, more accurate hierarchical categorisation is possible by using our simple feature selection technique.