Using Character Shape Coding for Information Retrieval

  • Authors:
  • Alan F. Smeaton;A. Lawrence Spitz

  • Affiliations:
  • -;-

  • Venue:
  • ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

In conventional information retrieval the task of finding users' search terms in a document is simple. When the document is not available in machine-readable format, optical character recognition (OCR) can usually be performed. We have developed a technique for performing information retrieval on document images in such a manner that the accuracy has great utility. The method makes generalisations about the images of characters, then performs classification of these and agglomerates the resulting character shape codes into word tokens based on character shape coding. These are sufficiently specific in their representation of the underlying words to allow reasonable performance of retrieval. Using a collection of over 250 Mbytes of document texts and queries with known relevance assessments, we present a series of experiments to determine how various parameters in the retrieval strategy affect retrieval performance and we obtain a surprisingly good results.