Handwritten and Machine Printed Text Separation in Document Images Using the Bag of Visual Words Paradigm

  • Authors:
  • Konstantinos Zagoris;Ioannis Pratikakis;Apostolos Antonacopoulos;Basilis Gatos;Nikos Papamarkos

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • ICFHR '12 Proceedings of the 2012 International Conference on Frontiers in Handwriting Recognition
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In a number of types of documents, ranging from forms to archive documents and books with annotations, machine printed and handwritten text may be present in the same document image, giving rise to significant issues within a digitisation and recognition pipeline. It is therefore necessary to separate the two types of text before applying different recognition methodologies to each. In this paper, a new approach is proposed which strives towards identifying and separating handwritten from machine printed text using the Bag of Visual Words paradigm (BoVW). Initially, blocks of interest are detected in the document image. For each block, a descriptor is calculated based on the BoVW. The final characterization of the blocks as Handwritten, Machine Printed or Noise is made by a Support Vector Machine classifier. The promising performance of the proposed approach is shown by using a consistent evaluation methodology which couples meaningful measures along with a new dataset.