Metadata Extraction from PDF Papers for Digital Library Ingest

  • Authors:
  • Simone Marinai

  • Affiliations:
  • -

  • Venue:
  • ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we analyze our recent research on the use of document analysis techniques for metadata extraction from PDF papers. We describe a package that is designed to extract basic metadata from these documents. The package is used in combination with a digital library software suite to easily build personal digital libraries. The proposed software is based on a suitable combination of several techniques that include PDF parsing, low level document image processing, and layout analysis. In addition, we use the information gathered from a widely known citation database (DBLP) to assist the tool in the difficult task of author identification. The system is tested on some paper collections selected from recent conference proceedings.