A metadata generation system for scanned scientific volumes

Authors:
Xiaonan Lu;Brewster Kahle;James Z. Wang;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA, USA;Internet Archive, San Francisco, CA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Year:
2008

Citing 9
Cited 2

CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Making large-scale support vector machine learning practical

Advances in kernel methods
Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Localizing experience of digital content via structural metadata

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
A service-oriented architecture for digital libraries

Proceedings of the 2nd international conference on Service oriented computing
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
The class imbalance problem: A systematic study

Intelligent Data Analysis

Automatic metadata generation for scanned scientific volumes

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories
Content integration in digital libraries

AMC '09 Proceedings of the 2009 workshop on Ambient media computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for real-world usage.