Structure extraction from PDF-based book documents

  • Authors:
  • Liangcai Gao;Zhi Tang;Xiaofan Lin;Ying Liu;Ruiheng Qiu;Yongtao Wang

  • Affiliations:
  • Peking University, Beijing, China;Peking University, Beijing, China;Vobile Inc., Santa Clara, CA, USA;Korea Advanced Institute of Science and Technology, Daejeon, South Korea;Peking University Founder Group Co.,Ltd., Beijing, China;Peking University, Beijing, China

  • Venue:
  • Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.