Robust Recognition of Documents by Fusing Results of Word Clusters

Authors:
Venkat Rasagna;Anand Kumar;C. V. Jawahar;R. Manmatha
Affiliations:
-;-;-;-
Venue:
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Year:
2009

Citing 0
Cited 4

Nearest neighbor based collection OCR

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Efficient logo retrieval through hashing shape context descriptors

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Towards more effective distance functions for word image matching

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
A post-processing scheme for malayalam using statistical sub-character language models

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

The word error rate of any optical character recognition system (OCR) is usually substantiallybelow its component or character error rate. This is especially true of Indic languages in which a word consists of many components. Current OCRs recognize each character or word separately and do not take advantage of document level constraints.We propose a document level OCR which incorporates information from the entire document to reduce word error rates. Word images are first clustered using a locality sensitive hashing technique. Individual words are thenrecognized using a (regular) OCR. The OCR outputs of word images in a cluster are then corrected probabilistically by comparing with the OCR outputs of other members of the same cluster. The approach may be applied to improve the accuracy of any OCR run on documents in any language. In particular, we demonstrate it for Telugu, where the use of language models for post-processing is not promising. We show a relative improvement of 28% for long words and 12% for all words which appear at least twice in the corpus.