Notes and references on early automatic classification work

  • Authors:
  • Karen Sparck Jones

  • Affiliations:
  • -

  • Venue:
  • ACM SIGIR Forum
  • Year:
  • 1991

Quantified Score

Hi-index 0.02

Visualization

Abstract

This informal note was prompted by discussions and questions at the 1990 AAAI Spring Symposium on Text-Based Intelligent Systems (cf Jacobs 1990). There is a growing interest in access to, and the use of, large scale full-text databases for a variety of purposes, and in the application of classification methods to organise the mass of data involved (see e.g. Church and Hanks 1990). A good deal of work has been done in this field in the past, but it is little known, and some of the early research literature is not very accessible. Classification is an area in which it is easy to make plausible but mistaken assumptions, and as this certainly holds for classification in retrieval, there is a good deal that can be usefully learnt from past experience, most of which was hard won from careful thought and grinding experiment. This paper is intended as an introduction to this initial work on automatic classification, to help those now becoming interested in classification to avoid unnecessarily repeating heavy effort or, more especially, reinventing square wheels. It should also be noted that automatic classification and related (e.g. seriation) methods have been extensively developed for biological applications in particular, but have been more variously applied, and that much of this work may be relevant in the broad area of machine learning.It must be emphasised that as this paper is focussed on early work on automatic classification, particularly for information retrieval, and is designed primarily to lead into this research and its literature, it does not attempt a critical evaluation of the overall results established by now, or of the current state of the art. However it should be pointed out that in the retrieval context in general, as opposed to the wider one of classification as a whole, there has been comparatively little work since the seventies, largely for the reasons indicated in the paper. More recent work in any case refers heavily to earlier research, so this note can be taken as an entry point to the research of the last decade for which some references are given at the end of the note.