Document cleanup using page frame detection

Authors:
Faisal Shafait;Joost van Beusekom;Daniel Keysers;Thomas M. Breuel
Affiliations:
German Research Center for Artificial Intelligence (DFKI), Image Understanding and Pattern Recognition (IUPR) Research Group, 67663, Kaiserslautern, Germany;Technical University of Kaiserslautern, Department of Computer Science, 67663, Kaiserslautern, Germany;German Research Center for Artificial Intelligence (DFKI), Image Understanding and Pattern Recognition (IUPR) Research Group, 67663, Kaiserslautern, Germany;Technical University of Kaiserslautern, Department of Computer Science, 67663, Kaiserslautern, Germany
Venue:
International Journal on Document Analysis and Recognition
Year:
2008

Citing 0
Cited 11

Simultaneous Document Margin Removal and Skew Correction Based on Corner Detection in Projection Profiles

ICIAP '09 Proceedings of the 15th International Conference on Image Analysis and Processing
Semi-supervised learning for text-line detection

Pattern Recognition Letters
Table detection in heterogeneous documents

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Page frame detection for double page document images

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Recognition driven page orientation detection

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Textline information extraction from grayscale camera-captured document images

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Decapod: a flexible, low cost digitization solution for small and medium archives

CBDAR'11 Proceedings of the 4th international conference on Camera-Based Document Analysis and Recognition
Border noise removal of camera-captured document images using page frame detection

CBDAR'11 Proceedings of the 4th international conference on Camera-Based Document Analysis and Recognition
The IUPR dataset of camera-captured document images

CBDAR'11 Proceedings of the 4th international conference on Camera-Based Document Analysis and Recognition
Removal of noise patterns in handwritten images using expectation maximization and fuzzy inference systems

Pattern Recognition
Margin noise removal from printed document images

Proceeding of the workshop on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.