Indexing for fast categorisation

  • Authors:
  • Vaughan R. Shanks;Hugh E. Williams;Adam Cannane

  • Affiliations:
  • School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne;School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne;School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne

  • Venue:
  • ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic categorisation is an important technique for the management of large document collections. Categorisation can be used to store or locate documents that satisfy an information need when the need cannot be expressed as a concise list of query terms. Inverted indexes are used in all query-based retrieval systems to allow efficient query processing. In this paper, we propose the application of inverted indexes to categorisation with the aim of developing a fast, scalable, and accurate approach. Specifically, we propose successful variants of inverted indexing to reduce index size: first, quantisation of term-category weights; second, compression of the quantised weights; and, last, storing only those weights that significantly impact the categorisation process. We show that our techniques permits fast, accurate categorisation: index size is reduced by orders of magnitude compared to conventional inverted indexing and the accuracy of categorisation is preserved.