Automated Evaluation of OCR Zoning
IEEE Transactions on Pattern Analysis and Machine Intelligence
A General Approach to Quality Evaluation of Document Segmentation Results
DAS '98 Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice
Ground-truthing and benchmarking document page segmentation
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
An Automatic Performance Evaluation Method for Document Page Segmentation
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
On benchmarking of invoice analysis systems
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Hi-index | 0.00 |
The decomposition of a document into segments such astext regions and graphics is a significant part of the documentanalysis process. The basic requirement for rating andimprovement of page segmentation algorithms is systematicevaluation. The approaches known from the literature havethe disadvantage that manually generated reference data(zoning ground truth) are needed for the evaluation task.The effort and cost of the creation of these data are veryhigh.This paper describes the evaluation system SEE andpresents an assessment of its quality.. The system requiresthe OCR generated text and the original text of the documentin correct reading order (text ground truth) as input.No manually generated zoning ground truth is needed.The implicit structure information that is contained in thetext ground truth is used for the evaluation of the automaticzoning. Therefore, an assignment of the corresponding textregions in the text ground truth and those in the OCR generatedtext (matches) is sought. A fault tolerant string matchingalgorithm underlies a method, able to tolerate OCR errorsin the text. The segmentation errors are determined asa result of the evaluation of the matching. Subsequently,the edit operations which are necessary for the correctionof the recognized segmentation errors are computed to estimatethe correction costs. Furthermore, SEE provides aversion of the OCR generated text, that is corrected fromthe detected page segmentation errors.