Automated document metadata extraction

Authors:
Bolanle Adefowoke Ojokoh;Olumide Sunday Adewale;Samuel Oluwole Falaki
Affiliations:
Department of Computer Science, Federal University ofTechnology, Nigeria;Department of Computer Science, Federal University ofTechnology, Nigeria;Department of Computer Science, Federal University ofTechnology, Nigeria
Venue:
Journal of Information Science
Year:
2009

Citing 9
Cited 0

Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic metadata generation & evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic discovery of logical document structure

Automatic discovery of logical document structure
A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Metaextract: an NLP system to automatically assign metadata

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal
Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web documents are available in various forms, most of which do not carry additional semantics. This paper presents a model for general document metadata extraction. The model, which combines segmentation by keywords and pattern matching techniques, was implemented using PHP, MySQL, JavaScript and HTML. The system was tested with 40 randomly selected PDF documents (mainly theses). An evaluation of the system was done using standard criteria measures namely precision, recall, accuracy and F-measure. The results show that the model is relatively effective for the task of metadata extraction, especially for theses and dissertations. A combination of machine learning with these rule-based methods will be explored in the future for better results.