Automatic localization of page segmentation errors

  • Authors:
  • Dheeraj Mundhra;Anand Mishra;C. V. Jawahar

  • Affiliations:
  • IIT Kharagpur, Kharagpur, India;IIIT Hyderabad, Hyderabad, India;IIIT Hyderabad, Hyderabad, India

  • Venue:
  • Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Page segmentation is a basic step in any character recognition system. Its failure is one of the major causes for deteriorating overall accuracy of the current Indian language OCR engines. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. Given the ground truth, locating page segmentation errors is a straight foreword problem and merely useful for comparing segmentation algorithms. In this work, we locate segmentation errors without directly using the ground truth. Such automatic localization of page segmentation errors can be considered a major step towards improving page segmentation errors. In this work, we focus on localizing line level segmentation errors. We perform experiments on more than 18000 scanned pages of 109 books belonging to four prominent south Indian languages.