Handwritten document retrieval strategies

Authors:
Venu Govindaraju;Huaigu Cao;Anurag Bhardwaj
Affiliations:
University at Buffalo - SUNY, Amherst, NY;BBN Technologies, Cambridge, MA;University at Buffalo - SUNY, Amherst, NY
Venue:
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Year:
2009

Citing 12
Cited 4

Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Modern Information Retrieval

Modern Information Retrieval
Using hidden Markov modeling to decompose human-written summaries

Computational Linguistics - Summarization
Retrieval methods for English-text with missrecognized OCR characters

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Boosted decision trees for word recognition in handwritten document retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Hangul Document Image Retrieval System Using Rank-based Recognitio

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Vector Model Based Indexing and Retrieval of Handwritten Medical Forms

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Topic based language models for OCR correction

Proceedings of the second workshop on Analytics for noisy unstructured text data
Automatic recognition of handwritten medical forms for search engines

International Journal on Document Analysis and Recognition
A probabilistic method for keyword retrieval in handwritten document images

Pattern Recognition

Handwritten Arabic text line segmentation using affinity propagation

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
A line-based representation for matching words in historical manuscripts

Pattern Recognition Letters
An information extraction system from patient historical documents

Proceedings of the 27th Annual ACM Symposium on Applied Computing
DocExplore: overcoming cultural and physical barriers to access ancient documents

Proceedings of the 2012 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the continuous growth of the World Wide Web, there is an urgent need for an efficient information retrieval system which can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains to be a challenging task with inadequate performance (around 30%, accuracy) thus proving to be a major hurdle in providing a robust search experience in the domain of handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text output by imperfect recognizers applied to handwritten document images. We describe three techniques each exploring a different approach for solving the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these approaches in detail and also present their performance using standard IR evaluation metrics.