The Strength of Weak Learnability
Machine Learning
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A simple and efficient skew detection algorithm via text row accumulation
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Advances in the BBN BYBLOS OCR System
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
A Complete OCR for Printed Hindi Text in Devanagari Script
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Tools for Enabling Digital Access to Multi-Lingual Indic Documents
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
An Overview of the Tesseract OCR Engine
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Learning to hash: forgiving hash functions and applications
Data Mining and Knowledge Discovery
Combined script and page orientation estimation using the Tesseract OCR engine
Proceedings of the International Workshop on Multilingual OCR
Hybrid Page Layout Analysis via Tab-Stop Detection
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
An effective partition approach for elastic application development on mobile cloud computing
GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
For human eyes only: security and usability evaluation
Proceedings of the 2012 ACM workshop on Privacy in the electronic society
Can we build language-independent OCR using LSTM networks?
Proceedings of the 4th International Workshop on Multilingual OCR
Multilingual OCR research and applications: an overview
Proceedings of the 4th International Workshop on Multilingual OCR
Hi-index | 0.00 |
We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.