Model-Based Information Extraction Method Tolerant of OCR Errors for Document Images

Authors:
Affiliations:
Venue:
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Year:
2001

Citing 0
Cited 4

Word Searching in Document Images Using Word Portion Matching

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Logical Labeling of Document Images Using Layout Graph Matching with Adaptive Learning

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
A Document Image Retrieval System

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract required keywords and their logical relationship from various printed documents. Such documents obtained from OCR results may have not only unknown words and compound words, but also incorrect words due to OCR errors. To cope with OCR errors, the proposed method adopts robust keyword matching which searches for a string pattern from two dimensional OCR results consisting of a set of possible character candidates. This keyword matching uses a keyword dictionary that includes incorrect words with typical OCR errors and segments of words to deal with the above difficulties. After keyword matching, a global document matching is carried out between keyword matching results in an input document and document models which consist of keyword models and their logical relationship. This global matching determines the most suitable model for the input document and solves word segmentation problems accurately even if the document has unknown words, compound words, or incorrect words. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.