A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

Authors:
Song Mao;Jong Woo Kim;George R. Thoma
Affiliations:
-;-;-
Venue:
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Year:
2004

Citing 0
Cited 9

Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal
A metadata generation system for scanned scientific volumes

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Automatic metadata generation using associative networks

ACM Transactions on Information Systems (TOIS)
Automated document metadata extraction

Journal of Information Science
Automated template-based metadata extraction architecture

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Header metadata extraction from semi-structured documents using template matching

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
Automatic metadata mining from multilingual enterprise content

Web Semantics: Science, Services and Agents on the World Wide Web
Determining the titles of Web pages using anchor text and link analysis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Obsolescence in storage media and the hardware and software for access and use can render old electronic files inaccessible and unusable. Therefore, the long-term preservation of digital materials has become an active area of research. At the U.S. National Library of Medicine (NLM), we are investigating the preservation of scanned and online medical journal articles, though other data types (e.g., video sequences) are also of interest. Metadata of different types have been proposed to save the information needed to preserve digital materials. Given the ever-increasing volume of medical journals and high labor cost of manual data entry, automated metadata extraction is crucial. A system has been developed at NLM to automatically generate descriptive metadata that includes title, author, affiliation, and abstract from scanned medical journals. A module called ZoneMatch is used to generate geometric and contextual features from a set of issues of each journal. A rule-based labeling module (called ZoneCzar) then uses these features to perform labeling independent of journal layout styles. However, if there are significant style variations among the issues of a same journal, the features generated from one set of journal issues may not be very useful to label a different set. In this paper, we describe a dynamic feature updating system in which the features used for labeling a current journal issue are generated from previous issues with similar layout style. This new system can adapt to possible style variations among different issues of the same journal. Experimental results presented show that the new system delivers improved labeling performance accuracy.