Hybrid Page Layout Analysis via Tab-Stop Detection

Authors:
Raymond W. Smith
Affiliations:
-
Venue:
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Year:
2009

Citing 0
Cited 4

Adapting the Tesseract open source OCR engine for multilingual OCR

Proceedings of the International Workshop on Multilingual OCR
Combined script and page orientation estimation using the Tesseract OCR engine

Proceedings of the International Workshop on Multilingual OCR
Table detection in heterogeneous documents

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Table detection in document images using header and trailer patterns

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.