Model based table cell detection and content extraction from degraded document images

Authors:
Zhixin Shi;Srirangaraj Setlur;Venu Govindaraju
Affiliations:
University at Buffalo, Buffalo, NY;University at Buffalo, Buffalo, NY;University at Buffalo, Buffalo, NY
Venue:
Proceeding of the workshop on Document Analysis and Recognition
Year:
2012

Citing 6
Cited 1

Precise Table Recognition by Making Use of Reference Tables

DAS '98 Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice
Model-based analysis of printed tables

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
Text Extraction from Gray Scale Historical Document Images Using Adaptive Local Connectivity Map

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Removing Rule-Lines from Binary Handwritten Arabic Document Images Using Directional Local Profile

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Image Enhancement for Degraded Binary Document Images

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition

Ruling-based table analysis for noisy handwritten documents

Proceedings of the 4th International Workshop on Multilingual OCR

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos. The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction. This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.