Handwritten document retrieval strategies

  • Authors:
  • Venu Govindaraju;Huaigu Cao;Anurag Bhardwaj

  • Affiliations:
  • University at Buffalo - SUNY, Amherst, NY;BBN Technologies, Cambridge, MA;University at Buffalo - SUNY, Amherst, NY

  • Venue:
  • Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the continuous growth of the World Wide Web, there is an urgent need for an efficient information retrieval system which can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains to be a challenging task with inadequate performance (around 30%, accuracy) thus proving to be a major hurdle in providing a robust search experience in the domain of handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text output by imperfect recognizers applied to handwritten document images. We describe three techniques each exploring a different approach for solving the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these approaches in detail and also present their performance using standard IR evaluation metrics.