Searching for ground truth: a stepping stone in automating genre classification

Authors:
Yunhyong Kim;Seamus Ross
Affiliations:
Digital Curation Centre & Humanities Adavanced Technology Information Institute, University of Glasgow, Glasgow, UK;Digital Curation Centre & Humanities Adavanced Technology Information Institute, University of Glasgow, Glasgow, UK
Venue:
DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Year:
2007

Citing 14
Cited 0

Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Integrating automatic genre analysis into digital libraries

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Random Forests

Machine Learning
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Fine-Grained Document Genre Classification Using First Order Random Graphs

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Investigating GIS and smoothing for maximum entropy taggers

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Clustering document images using a bag of symbols representation

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style

Journal of the American Society for Information Science and Technology
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Genre classification in automated ingest and appraisal metadata

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
PERC: a personal email classifier

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.