SA_MetaMatch: relevant document discovery through document metadata and indexing

Authors:
Hiu S. Yau;J. Scott Hawker
Affiliations:
University of Alabama, Tuscaloosa, AL;University of Alabama, Tuscaloosa, AL
Venue:
ACM-SE 42 Proceedings of the 42nd annual Southeast regional conference
Year:
2004

Citing 0
Cited 4

Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Delivering knowledge to NASA scientist and engineers: using phrase matching to determine document similarity

Proceedings of the 43rd annual Southeast regional conference - Volume 1
NASA's standards advisor pilot: search solutions for an intranet

Proceedings of the 44th annual Southeast regional conference
Web page title extraction and its application

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

SA_MetaMatch, a component of the Standards Advisor (SA), is designed to find relevant documents through matching indices of metadata and document content. The elements in the metadata schema are mainly adopted from the Dublin Core (DC). The implementation of the XML metadata schema and coding follows the DC recommended guidelines. After metadata is generated manually for an unstructured document, or is extracted automatically from documents of well defined layout, they are stored in metadata files or in a repository. The indices of the descriptive metadata elements and that of the document content are generated. They are searched and compared to find related documents, based on our observation that if the metadata and high frequency index words of document content are related, then the corresponding documents are likely to be related as well. A ranked list of possible relevant documents is returned as the result. Several matching algorithms have been explored. We selected a sum of word-scored approach which not only gives relevant scores for the matched documents, but also gives an individual score for each of the matching words which provide hints for domain experts to grasp the concepts in the documents.