Unconstrained handwritten document retrieval

  • Authors:
  • Huaigu Cao;Venu Govindaraju;Anurag Bhardwaj

  • Affiliations:
  • Raytheon BBN Technologies, 02138, Cambridge, MA, USA;University at Buffalo, Department of Computer Science and Engineering, 14260, Amherst, NY, USA;University at Buffalo, Department of Computer Science and Engineering, 14260, Amherst, NY, USA

  • Venue:
  • International Journal on Document Analysis and Recognition - Special issue on noisy text analytics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the ever-increasing growth of the World Wide Web, there is an urgent need for an efficient information retrieval system that can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains a challenging task with inadequate performance thus proving to be a major hurdle in providing robust search experience in handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text derived from imperfect handwriting recognizers. First, we describe a novel term frequency estimation technique incorporating the word segmentation information inside the retrieval framework to improve the overall system performance. Second, we outline a taxonomy of different techniques used for addressing the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR’ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR’ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR’ed text. We describe these techniques in detail and also discuss their performance measures using standard IR evaluation metrics.