Adapting the Tesseract open source OCR engine for multilingual OCR
Proceedings of the International Workshop on Multilingual OCR
Combined script and page orientation estimation using the Tesseract OCR engine
Proceedings of the International Workshop on Multilingual OCR
Table detection in heterogeneous documents
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Table detection in document images using header and trailer patterns
Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Hi-index | 0.00 |
A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.