OCR with No Shape Training

  • Authors:
  • Affiliations:
  • Venue:
  • ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a document-specific OCR system and apply it to a corpus of faxed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a 驴clump驴 metric, typically yields several hundred clusters with highly skewed populations. Maximizing matches with a lexicon of English words assign letter identities to each cluster. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in (1) using real data (2) a more appropriate clustering algorithm, and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters.