Knowledge-based metadata extraction from PostScript files
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Fine-Grained Document Genre Classification Using First Order Random Graphs
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Automatic detection of text genre
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Clustering document images using a bag of symbols representation
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Learning subjective nouns using extraction pattern bootstrapping
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Performance comparison of six algorithms for page segmentation
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
PERC: a personal email classifier
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Searching for ground truth: a stepping stone in automating genre classification
DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Hi-index | 0.00 |
Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.