Model-Based Information Extraction Method Tolerant of OCR Errors for Document Images

  • Authors:
  • Affiliations:
  • Venue:
  • ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. Such documents obtained from OCR results may have not only unknown words and compound words, but also incorrect words due to OCR errors. To cope with OCR errors, the proposed method adopts robust keyword matching which searches for a string pattern from two dimensional OCR results consisting of a set of possible character candidates. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with the above difficulties. After keyword matching, a global document matching is carried out between keyword matching results in an input document and document models which consist of keyword models and their logical relationship. This global matching determines the most suitable model for the input document and solves word segmentation problems accurately even if the document has unknown words, compound words, or incorrect words. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.