Using Character Shape Coding for Information Retrieval

Authors:
Alan F. Smeaton;A. Lawrence Spitz
Affiliations:
-;-
Venue:
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Year:
1997

Citing 0
Cited 13

The impact on retrieval effectiveness of skewed frequency distributions

ACM Transactions on Information Systems (TOIS)
Document image retrieval without OCRing using a video scanning system

MULTIMEDIA '00 Proceedings of the 2000 ACM workshops on Multimedia
Information Retrieval from Documents: A Survey

Information Retrieval
Group 4 Compressed Document Matching

DAS '98 Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice
Spotting Where to Read on Pages - Retrieval of Relevant Parts from Page Images

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Indexing and retrieval of words in old documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A search engine for imaged documents in PDF files

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Camera-Based Document Image Retrieval as Voting for Partial Signatures of Projective Invariants

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Robust image based document comparison using attributed relational graphs

SPPRA '08 Proceedings of the Fifth IASTED International Conference on Signal Processing, Pattern Recognition and Applications
A survey of keyword spotting techniques for printed document images

Artificial Intelligence Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

In conventional information retrieval the task of finding users' search terms in a document is simple. When the document is not available in machine-readable format, optical character recognition (OCR) can usually be performed. We have developed a technique for performing information retrieval on document images in such a manner that the accuracy has great utility. The method makes generalisations about the images of characters, then performs classification of these and agglomerates the resulting character shape codes into word tokens based on character shape coding. These are sufficiently specific in their representation of the underlying words to allow reasonable performance of retrieval. Using a collection of over 250 Mbytes of document texts and queries with known relevance assessments, we present a series of experiments to determine how various parameters in the retrieval strategy affect retrieval performance and we obtain a surprisingly good results.