Automated template-based metadata extraction architecture

Authors:
Paul Flynn;Li Zhou;Kurt Maly;Steven Zeil;Mohammad Zubair
Affiliations:
Department of Computer Science, Old Dominion University, Norfolk, VA;Department of Computer Science, Old Dominion University, Norfolk, VA;Department of Computer Science, Old Dominion University, Norfolk, VA;Department of Computer Science, Old Dominion University, Norfolk, VA;Department of Computer Science, Old Dominion University, Norfolk, VA
Venue:
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Year:
2007

Citing 5
Cited 3

Encyclopedia of software engineering

Encyclopedia of software engineering
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Automatic Extraction of Reference Linking Information from Online Documents

Automatic Extraction of Reference Linking Information from Online Documents
A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Rule-based word clustering for document metadata extraction

Proceedings of the 2005 ACM symposium on Applied computing

Automatically generating high quality metadata by analyzing the document code of common file types

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Automatic mining of cognitive metadata using fuzzy inference

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Automatic metadata mining from multilingual enterprise content

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes our efforts to develop a toolset and process for automated metadata extraction from large, diverse, and evolving document collections. A number of federal agencies, universities, laboratories, and companies are placing their collections online and making them searchable via metadata fields such as author, title, and publishing organization. Manually creating metadata for a large collection is an extremely time-consuming task, but is difficult to automate, particularly for collections consisting of documents with diverse layout and structure. Our automated process enables many more documents to be available online than would otherwise have been possible due to time and cost constraints. We describe our architecture and implementation and illustrate the effectiveness of the tool-set by providing experimental results on two major collections DTIC (Defense Technical Information Center) and NASA (National Aeronautics and Space Administration).