Structure extraction from PDF-based book documents

Authors:
Liangcai Gao;Zhi Tang;Xiaofan Lin;Ying Liu;Ruiheng Qiu;Yongtao Wang
Affiliations:
Peking University, Beijing, China;Peking University, Beijing, China;Vobile Inc., Santa Clara, CA, USA;Korea Advanced Institute of Science and Technology, Daejeon, South Korea;Peking University Founder Group Co.,Ltd., Beijing, China;Peking University, Beijing, China
Venue:
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Year:
2011

Citing 20
Cited 3

Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Finding and using implicit structure in human-organized spatial layouts of information

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document Processing for Automatic Knowledge Acquisition

IEEE Transactions on Knowledge and Data Engineering
Logical Structure Analysis of Book Document Images Using Contents Information

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
AIDAS: Incremental Logical Structure Discovery in PDF Documents

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Document Transformation System from Papers to XML Data Based on Pivot XML Document Method

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach

IEEE Transactions on Knowledge and Data Engineering
Optimized XY-Cut for Determining a Page Reading Order

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Automatic extraction of table metadata from digital documents

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Improving navigation interaction in digital documents

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
BibPro: A Citation Parser Based on Sequence Alignment Techniques

AINAW '08 Proceedings of the 22nd International Conference on Advanced Information Networking and Applications - Workshops
Comprehensive Global Typography Extraction System for Electronic Book Documents

DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring

DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
CEBBIP: a parser of bibliographic information in chinese electronic books

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
User-Guided Wrapping of PDF Documents Using Graph Matching Techniques

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Analysis of Book Documents' Table of Content Based on Clustering

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Information extraction by finding repeated structure

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
A system for converting PDF documents into structured XML format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Transforming Japanese archives into accessible digital books

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Extracting and matching authors and affiliations in scholarly documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.