Semi-supervised learning for text-line detection

Authors:
Zongyi Liu;Hanning Zhou;Ning Yang
Affiliations:
Amazon.com (TCC-1527/J2), 701 Fifth Avenue Suite 1500, Seattle, WA 98104, USA;Amazon.com (TCC-1527/J2), 701 Fifth Avenue Suite 1500, Seattle, WA 98104, USA;Amazon.com (TCC-1527/J2), 701 Fifth Avenue Suite 1500, Seattle, WA 98104, USA
Venue:
Pattern Recognition Letters
Year:
2010

Citing 12
Cited 0

A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Prototype Document Image Analysis System for Technical Journals

Computer
Page segmentation and classification

CVGIP: Graphical Models and Image Processing
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Two Geometric Algorithms for Layout Analysis

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Automated Borders Detection and Adaptive Segmentation for Binary Document Images

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Document zone content classification and its performance evaluation

Pattern Recognition
Robust Page Segmentation Based on Smearing and Error Correction Unifying Top-down and Bottom-up Approaches

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document cleanup using page frame detection

International Journal on Document Analysis and Recognition
Divergence measures based on the Shannon entropy

IEEE Transactions on Information Theory
Morphological grayscale reconstruction in image analysis: applications and efficient algorithms

IEEE Transactions on Image Processing

Quantified Score

Hi-index	0.10

Visualization

Abstract

Automatically detecting text-lines from document images has been long studied. However, most researchers today are focusing on boosting the detection rate instead of noise removal. In this paper, we propose a semi-supervised learning framework that targets to segment Manhattan-layout documents with significant levels of noise. The algorithm consists of three steps: first, an initial segmentation process uses the seed filling algorithm; second, an iterative grouping process uses the projection profiles to estimate the vertical border of page contents; third, an inside page-content noise removal uses the online training and classification. We test our algorithm using two databases. The first is the University of Washington (UW)-III database with 1,600 images of different input qualities that has been widely used by the Document Analysis Research (DAR) communities to measure segmentation algorithm performance. The second is the NILE database created by sampling from 320 journals pages of east Asian, east European and middle Eastern languages. The result shows that our framework achieves competitive performance in terms of both page frame level segmentation and text-line level segmentation, and is particularly strong at filtering noise. It also shows that our algorithm is more adaptive to language variations.