Model based table cell detection and content extraction from degraded document images

  • Authors:
  • Zhixin Shi;Srirangaraj Setlur;Venu Govindaraju

  • Affiliations:
  • University at Buffalo, Buffalo, NY;University at Buffalo, Buffalo, NY;University at Buffalo, Buffalo, NY

  • Venue:
  • Proceeding of the workshop on Document Analysis and Recognition
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a novel method for detection and extraction of contents of table cells from handwritten document images. Given a model of the table and a document image containing a table, the hand-drawn or pre-printed table is detected and the contents of the table cells are extracted automatically. The algorithms described are designed to handle degraded binary document images. The target images may include a wide variety of noise, ranging from clutter noise, salt-and-pepper noise to non-text objects such as graphics and logos. The presented algorithm effectively eliminates extraneous noise and identifies potentially matching table layout candidates by detecting horizontal and vertical table line candidates. A table is represented as a matrix based on the locations of intersections of horizontal and vertical table lines, and a matching algorithm searches for the best table structure that matches the given layout model and using the matching score to eliminate spurious table line candidates. The optimally matched table candidate is then used for cell content extraction. This method was tested on a set of document page images containing tables from the challenge set of the DARPA MADCAT Arabic handwritten document image data. Preliminary results indicate that the method is effective and is capable of reliably extracting text from the table cells.