PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

Authors:
Ermelinda Oro;Massimo Ruffolo
Affiliations:
-;-
Venue:
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Year:
2009

Citing 0
Cited 4

Towards a common evaluation strategy for table structure recognition algorithms

Proceedings of the 10th ACM symposium on Document engineering
A methodology for evaluating algorithms for table understanding in PDF documents

Proceedings of the 2012 ACM symposium on Document engineering
A practical method for compatibility evaluation of portable document formats

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
Towards generic framework for tabular data extraction and management in documents

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells.The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.