Using a boosted tree classifier for text segmentation in hand-annotated documents

Authors:
Xujun Peng;Srirangaraj Setlur;Venu Govindaraju;Sitaram Ramachandrula
Affiliations:
Center for Unified Biometrics and Sensors (CUBS), Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, USA;Center for Unified Biometrics and Sensors (CUBS), Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, USA;Center for Unified Biometrics and Sensors (CUBS), Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, USA;HP Labs India, Hosur Main Road, Adugodi, Bangalore, India
Venue:
Pattern Recognition Letters
Year:
2012

Citing 26
Cited 1

On the boosting ability of top-down decision tree learning algorithms

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Segmentation of page images using the area Voronoi diagram

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
The Document Spectrum for Page Layout Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Characterizing and Distinguishing Text in Bank Cheque Images

SIBGRAPI '02 Proceedings of the 15th Brazilian Symposium on Computer Graphics and Image Processing
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Separating Handwritten Material from Machine Printed Text Using Hidden Markov Models

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Machine Printed Text and Handwriting Identification in Noisy Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Classification of Machine-Printed and Handwritten Addresses on Korean Mail Piece Images Using Geometric Features

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
AdaTree: Boosting a Weak Classifier into a Decision Tree

CVPRW '04 Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04) Volume 6 - Volume 06
Probabilistic Boosting-Tree: Learning Discriminative Models for Classification, Recognition, and Clustering

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Automatic name extraction from degraded document images

Pattern Analysis & Applications
Identifying Handwritten Text in Mixed Documents

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Boosting for Learning Multiple Classes with Imbalanced Class Distribution

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
A Hybrid Re-sampling Method for SVM Learning from Imbalanced Data Sets

FSKD '08 Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 02
Separation of Overlapping and Touching Lines within Handwritten Arabic Documents

CAIP '09 Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
A brief introduction to boosting

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
A Hierarchical Classification Model for Document Categorization

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Voronoi++: A Dynamic Page Segmentation Approach Based on Voronoi and Docstrum Features

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Markov Random Field Based Text Identification from Annotated Machine Printed Documents

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Overlapped text segmentation using Markov random field and aggregation

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Boosting support vector machines for imbalanced data sets

Knowledge and Information Systems
Text Separation from Mixed Documents Using a Tree-Structured Classifier

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Comparison of texture features based on Gabor filters

IEEE Transactions on Image Processing

An optimization for binarization methods by removing binary artifacts

Pattern Recognition Letters

Quantified Score

Hi-index	0.10

Visualization

Abstract

A boosted tree classifier is proposed to segment machine printed, handwritten and overlapping text from documents with handwritten annotations. Each node of the tree-structured classifier is a binary weak learner. Unlike a standard decision tree (DT) which only considers a subset of training data at each node and is susceptible to over-fitting, we boost the tree using all available training data at each node with different weights. The proposed method is evaluated on a set of machine-printed documents which have been annotated by multiple writers in an office/collaborative environment. The experimental results show that the proposed algorithm outperforms other methods on an imbalanced data set.