Metadata Extraction from PDF Papers for Digital Library Ingest

Authors:
Simone Marinai
Affiliations:
-
Venue:
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Year:
2009

Citing 0
Cited 3

Table of contents recognition for converting PDF documents in e-book formats

Proceedings of the 10th ACM symposium on Document engineering
Towards a faithful visualization of historical books on e-book readers

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Web-based citation parsing, correction and augmentation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we analyze our recent research on the use of document analysis techniques for metadata extraction from PDF papers. We describe a package that is designed to extract basic metadata from these documents. The package is used in combination with a digital library software suite to easily build personal digital libraries. The proposed software is based on a suitable combination of several techniques that include PDF parsing, low level document image processing, and layout analysis. In addition, we use the information gathered from a widely known citation database (DBLP) to assist the tool in the difficult task of author identification. The system is tested on some paper collections selected from recent conference proceedings.