Table Recognition and Understanding from PDF Files

  • Authors:
  • T. Hassan;R. Baumgartner

  • Affiliations:
  • Vienna University of Technology;Vienna University of Technology

  • Venue:
  • ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a flexible method for detecting and under- standing tables in PDF files, which is not reliant upon one particular feature being present, for example ruling lines or indentations, and is therefore applicable to a wide variety of visual presentations. We describe the steps required in transforming the low-level PDF instructions into text seg- ments, lines and boxes on a page. We propose three different classifications for published tables, and develop methods to detect these tables and correctly identify their respective rows and columns. We also explain how to recognize span- ning rows and columns, and multi-line rows. Experimental results show that our algorithm is effective in converting a wide variety of tabular presentations into HTML for infor- mation extraction purposes.