Adapting the Tesseract open source OCR engine for multilingual OCR

Authors:
Ray Smith;Daria Antonova;Dar-Shyang Lee
Affiliations:
Google Inc., Mountain View, CA;Google Inc., Mountain View, CA;Google Inc., Mountain View, CA
Venue:
Proceedings of the International Workshop on Multilingual OCR
Year:
2009

Citing 10
Cited 4

The Strength of Weak Learnability

Machine Learning
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A simple and efficient skew detection algorithm via text row accumulation

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Advances in the BBN BYBLOS OCR System

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
A Complete OCR for Printed Hindi Text in Devanagari Script

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Tools for Enabling Digital Access to Multi-Lingual Indic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Learning to hash: forgiving hash functions and applications

Data Mining and Knowledge Discovery
Combined script and page orientation estimation using the Tesseract OCR engine

Proceedings of the International Workshop on Multilingual OCR
Hybrid Page Layout Analysis via Tab-Stop Detection

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

An effective partition approach for elastic application development on mobile cloud computing

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
For human eyes only: security and usability evaluation

Proceedings of the 2012 ACM workshop on Privacy in the electronic society
Can we build language-independent OCR using LSTM networks?

Proceedings of the 4th International Workshop on Multilingual OCR
Multilingual OCR research and applications: an overview

Proceedings of the 4th International Workshop on Multilingual OCR

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.