Table Recognition and Understanding from PDF Files

Authors:
T. Hassan;R. Baumgartner
Affiliations:
Vienna University of Technology;Vienna University of Technology
Venue:
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Year:
2007

Citing 0
Cited 10

Object-level document analysis of PDF files

Proceedings of the 9th ACM symposium on Document engineering
GraphWrap: a system for interactive wrapping of pdf documents using graph matching techniques

Proceedings of the 9th ACM symposium on Document engineering
Enabling Interactive Access to Web Tables

Proceedings of the 13th International Conference on Human-Computer Interaction. Part I: New Trends
Converting PDF to HTML approach based on text detection

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Scalable web data extraction for online market intelligence

Proceedings of the VLDB Endowment
Table of contents recognition for converting PDF documents in e-book formats

Proceedings of the 10th ACM symposium on Document engineering
Towards a common evaluation strategy for table structure recognition algorithms

Proceedings of the 10th ACM symposium on Document engineering
Enhancing browsing experience of table and image elements in web pages

International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
Enabling efficient browsing and manipulation of web tables on smartphone

HCII'11 Proceedings of the 14th international conference on Human-computer interaction: towards mobile and intelligent interaction environments - Volume Part III
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a flexible method for detecting and under- standing tables in PDF files, which is not reliant upon one particular feature being present, for example ruling lines or indentations, and is therefore applicable to a wide variety of visual presentations. We describe the steps required in transforming the low-level PDF instructions into text seg- ments, lines and boxes on a page. We propose three different classifications for published tables, and develop methods to detect these tables and correctly identify their respective rows and columns. We also explain how to recognize span- ning rows and columns, and multi-line rows. Experimental results show that our algorithm is effective in converting a wide variety of tabular presentations into HTML for infor- mation extraction purposes.