OCR with No Shape Training

Authors:
Affiliations:
Venue:
ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Year:
2000

Citing 0
Cited 6

Adaptive OCR with Limited User Feedback

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Multi-character field recognition for Arabic and Chinese handwriting

SACH'06 Proceedings of the 2006 conference on Arabic and Chinese handwriting recognition
Nearest neighbor based collection OCR

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Bounding the probability of error for high precision optical character recognition

The Journal of Machine Learning Research
Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications

The Journal of Machine Learning Research
Estimation, learning, and adaptation: systems that improve with use

SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a document-specific OCR system and apply it to a corpus of faxed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a 驴clump驴 metric, typically yields several hundred clusters with highly skewed populations. Maximizing matches with a lexicon of English words assign letter identities to each cluster. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in (1) using real data (2) a more appropriate clustering algorithm, and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters.