Automatically generating high quality metadata by analyzing the document code of common file types

Authors:
Lars Fredrik Høimyr Edvardsen;Ingeborg Torvik Sølvberg;Trond Aalberg;Hallvard Trætteberg
Affiliations:
Intelligent Communication AS/The Norwegian University of Science and Technology, Oslo, Norway;The Norwegian University of Science and Technology, Trondheim, Norway;The Norwegian University of Science and Technology, Trondheim, Norway;The Norwegian University of Science and Technology, Trondheim, Norway
Venue:
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Year:
2009

Citing 10
Cited 0

Understanding Quality in Conceptual Modeling

IEEE Software
Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic metadata generation & evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automating metadata generation: the simple indexing interface

WWW '05 Proceedings of the 14th international conference on World Wide Web
A new approach to intranet search based on information extraction

Proceedings of the 14th ACM international conference on Information and knowledge management
Creating MAGIC: system for generating learning object metadata for instructional content

Proceedings of the 13th annual ACM international conference on Multimedia
Web page title extraction and its application

Information Processing and Management: an International Journal
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Making metadata go away: "hiding everything but the benefits"

DCMI '04 Proceedings of the 2004 international conference on Dublin Core and metadata applications: metadata across languages and cultures
Automated template-based metadata extraction architecture

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

A major challenge for content management in intranets and other large scale document storage and retrieval services is the generation of high quality metadata. Manual generation of metadata is resource demanding and is often viewed by collection managers and document authors as inefficient use of their time, and there is a desire for other ways to create the needed metadata. Automatic Metadata Generation (AMG) is methods for generating metadata without manual interaction using computer program(s) to interpret the document and possibly the document context. Current AMG research has been limited to collection of similarly formatted documents. The research presented in this paper expands the field of AMG by presenting an approach that is independent of a common visualization scheme; AMG based on document code analysis. This is done by showing AMG possibilities from Latex, Word and PowerPoint documents and how this approach can significantly increase the quality of the generated metadata. This by avoiding common quality reducing factors as missing completeness, low accuracy, logical consistency and coherence and timeliness by giving AMG algorithms direct access to the user specified intellectual content and the file formatting. This research shows how this AMG approach can be combined with other AMG approaches, drawing on their strengths in order to achieve the desired high quality metadata entities.