Adaptive OCR with Limited User Feedback
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Multi-character field recognition for Arabic and Chinese handwriting
SACH'06 Proceedings of the 2006 conference on Arabic and Chinese handwriting recognition
Nearest neighbor based collection OCR
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Bounding the probability of error for high precision optical character recognition
The Journal of Machine Learning Research
The Journal of Machine Learning Research
Estimation, learning, and adaptation: systems that improve with use
SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Hi-index | 0.00 |
We present a document-specific OCR system and apply it to a corpus of faxed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a 驴clump驴 metric, typically yields several hundred clusters with highly skewed populations. Maximizing matches with a lexicon of English words assign letter identities to each cluster. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in (1) using real data (2) a more appropriate clustering algorithm, and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters.