Combinatorial optimization: algorithms and complexity
Combinatorial optimization: algorithms and complexity
Finding and using implicit structure in human-organized spatial layouts of information
CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Document Processing for Automatic Knowledge Acquisition
IEEE Transactions on Knowledge and Data Engineering
Logical Structure Analysis of Book Document Images Using Contents Information
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
AIDAS: Incremental Logical Structure Discovery in PDF Documents
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Document Transformation System from Papers to XML Data Based on Pivot XML Document Method
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach
IEEE Transactions on Knowledge and Data Engineering
Optimized XY-Cut for Determining a Page Reading Order
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Automatic extraction of table metadata from digital documents
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Improving navigation interaction in digital documents
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
BibPro: A Citation Parser Based on Sequence Alignment Techniques
AINAW '08 Proceedings of the 22nd International Conference on Advanced Information Networking and Applications - Workshops
Comprehensive Global Typography Extraction System for Electronic Book Documents
DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
Dolores: An Interactive and Class-Free Approach for Document Logical Restructuring
DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
CEBBIP: a parser of bibliographic information in chinese electronic books
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
User-Guided Wrapping of PDF Documents Using Graph Matching Techniques
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Analysis of Book Documents' Table of Content Based on Clustering
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Information extraction by finding repeated structure
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
A system for converting PDF documents into structured XML format
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Transforming Japanese archives into accessible digital books
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Extracting and matching authors and affiliations in scholarly documents
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Searching online book documents and analyzing book citations
Proceedings of the 2013 ACM symposium on Document engineering
Hi-index | 0.00 |
Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.