Using a boosted tree classifier for text segmentation in hand-annotated documents

  • Authors:
  • Xujun Peng;Srirangaraj Setlur;Venu Govindaraju;Sitaram Ramachandrula

  • Affiliations:
  • Center for Unified Biometrics and Sensors (CUBS), Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, USA;Center for Unified Biometrics and Sensors (CUBS), Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, USA;Center for Unified Biometrics and Sensors (CUBS), Department of Computer Science and Engineering, University at Buffalo, Amherst, NY, USA;HP Labs India, Hosur Main Road, Adugodi, Bangalore, India

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2012

Quantified Score

Hi-index 0.10

Visualization

Abstract

A boosted tree classifier is proposed to segment machine printed, handwritten and overlapping text from documents with handwritten annotations. Each node of the tree-structured classifier is a binary weak learner. Unlike a standard decision tree (DT) which only considers a subset of training data at each node and is susceptible to over-fitting, we boost the tree using all available training data at each node with different weights. The proposed method is evaluated on a set of machine-printed documents which have been annotated by multiple writers in an office/collaborative environment. The experimental results show that the proposed algorithm outperforms other methods on an imbalanced data set.