Table detection in heterogeneous documents

Authors:
Faisal Shafait;Ray Smith
Affiliations:
German Research Center for Artificial Intelligence (DFKI GmbH), Kaiserslautern, Germany;Google Inc., CA
Venue:
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Year:
2010

Citing 10
Cited 2

Automatic Table Ground Truth Generation and a Background-Analysis-Based Table Structure Extraction Method

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Applying the T-Recs Table Recognition System to the Business Letter Domain

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
An Approach towards Benchmarking of Table Structure Recognition Results

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Document zone content classification and its performance evaluation

Pattern Recognition
An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence
Document cleanup using page frame detection

International Journal on Document Analysis and Recognition
Hybrid Page Layout Analysis via Tab-Stop Detection

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Learning Rich Hidden Markov Models in Document Analysis: Table Location

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Automatic table detection in document images

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I

Table detection in document images using header and trailer patterns

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Ruling-based table analysis for noisy handwritten documents

Proceedings of the 4th International Workshop on Multilingual OCR

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table detection mainly focus on detecting tables in single columns of text and do not work reliably on documents with varying layouts. This paper presents a practical algorithm for table detection that works with a high accuracy on documents with varying layouts (company reports, newspaper articles, magazine pages, ...). An open source implementation of the algorithm is provided as part of the Tesseract OCR engine. Evaluation of the algorithm on document images from publicly available UNLV dataset shows competitive performance in comparison to the table detection module of a commercial OCR system.