Evaluating SEE - A Benchmarking System for Document Page Segmentation

  • Authors:
  • Stefan Agne;Andreas Dengel;Bertin Klein

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The decomposition of a document into segments such astext regions and graphics is a significant part of the documentanalysis process. The basic requirement for rating andimprovement of page segmentation algorithms is systematicevaluation. The approaches known from the literature havethe disadvantage that manually generated reference data(zoning ground truth) are needed for the evaluation task.The effort and cost of the creation of these data are veryhigh.This paper describes the evaluation system SEE andpresents an assessment of its quality.. The system requiresthe OCR generated text and the original text of the documentin correct reading order (text ground truth) as input.No manually generated zoning ground truth is needed.The implicit structure information that is contained in thetext ground truth is used for the evaluation of the automaticzoning. Therefore, an assignment of the corresponding textregions in the text ground truth and those in the OCR generatedtext (matches) is sought. A fault tolerant string matchingalgorithm underlies a method, able to tolerate OCR errorsin the text. The segmentation errors are determined asa result of the evaluation of the matching. Subsequently,the edit operations which are necessary for the correctionof the recognized segmentation errors are computed to estimatethe correction costs. Furthermore, SEE provides aversion of the OCR generated text, that is corrected fromthe detected page segmentation errors.