Genre classification in automated ingest and appraisal metadata

Authors:
Yunhyong Kim;Seamus Ross
Affiliations:
Digital Curation Centre (DCC) & Humanities Adavanced Technology Information Institute (HATII), University of Glasgow, Glasgow, UK;Digital Curation Centre (DCC) & Humanities Adavanced Technology Information Institute (HATII), University of Glasgow, Glasgow, UK
Venue:
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Year:
2006

Citing 11
Cited 1

Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Fine-Grained Document Genre Classification Using First Order Random Graphs

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Clustering document images using a bag of symbols representation

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Learning subjective nouns using extraction pattern bootstrapping

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Performance comparison of six algorithms for page segmentation

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
PERC: a personal email classifier

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Searching for ground truth: a stepping stone in automating genre classification

DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.