Towards generic framework for tabular data extraction and management in documents

Authors:
Roya Rastan
Affiliations:
University of New South Wales, Sydney, Australia
Venue:
Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
Year:
2013

Citing 20
Cited 1

TINTIN: a system for retrieval in text tables

DL '97 Proceedings of the second ACM international conference on Digital libraries
Detecting Tables in HTML Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Ontology Generation from Tables

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Identifying table boundaries in digital documents via sparse line detection

Proceedings of the 17th ACM conference on Information and knowledge management
XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents

ICTAI '08 Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence - Volume 01
Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Google fusion tables: data management, integration and collaboration in the cloud

Proceedings of the 1st ACM symposium on Cloud computing
Towards a common evaluation strategy for table structure recognition algorithms

Proceedings of the 10th ACM symposium on Document engineering
Ontology Generation from Web Tables: A 1+1+N Approach

IFITA '10 Proceedings of the 2010 International Forum on Information Technology and Applications - Volume 01
A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Table detection from plain text using machine learning and document structure

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
The lixto project: exploring new frontiers of web data extraction

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Notes on contemporary table recognition

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Endless and Scalable Knowledge Table Extraction from Semi-structured Websites

ICDMW '12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining Workshops

PIKM 2013: the 6th ACM workshop for ph.d. students in information and knowledge management

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tables are one of the common data presentation structures in documents. However, the task of automatic recognition and extraction of tables embedded in documents is still a significant challenge, and data contained within tables still remains under-utilised. Although some common steps can be defined for table extraction, there is no generic approach for table extraction tasks which can be applied to different sources and provide an end-to-end repeatable work-flow. This paper looks at the table extraction problem from the process point of view and proposes a table extraction workflow, which can be considered as a plug-and-play architecture for table extraction. Also, we present an overview of our complete system where the extracted tables are stored and managed. Table extraction is considered in the context of financial statements in this work, but the methods apply generally.