An efficient pre-processing method to identify logical components from PDF documents

Authors:
Ying Liu;Kun Bai;Liangcai Gao
Affiliations:
Department of Knowledge Service Engineering, KAIST;IBM Research T.J. Watson Research Center;Institute of Computer Science and Technology, Peking University
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Year:
2011

Citing 22
Cited 0

A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Naive Bayesian Classifier Committees

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A retargetable table reader

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Detecting Tables in HTML Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Recursive X-Y cut using bounding boxes of connected components

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Automatic Table Ground Truth Generation and a Background-Analysis-Based Table Structure Extraction Method

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Applying the T-Recs Table Recognition System to the Business Letter Domain

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
Learning to recognize tables in free text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using visual cues for extraction of tabular data from arbitrary HTML documents

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Identifying table boundaries in digital documents via sparse line detection

Proceedings of the 17th ACM conference on Information and knowledge management
Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Enhancing document structure analysis using visual analytics

Proceedings of the 2010 ACM Symposium on Applied Computing
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents. This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format - PDF documents. The experimental results prove the effectiveness of the sparse line analysis.