Assessing the practical usability of an automatically annotated corpus

Authors:
Faisal Mahbub Chowdhury;Alberto Lavelli
Affiliations:
University of Trento, Italy and Human Language Technology Research Unit, Fondazione Bruno Kessler, Trento, Italy;Human Language Technology Research Unit, Fondazione Bruno Kessler, Trento, Italy
Venue:
LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Year:
2011

Citing 6
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Weakly supervised natural language learning without redundant views

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Bootstrapping POS taggers using unlabelled data

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Reranking and self-training for parser adaptation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A proposal for a configurable silver standard

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Disease mention recognition with specific features

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The creation of a gold standard corpus (GSC) is a very laborious and costly process. Silver standard corpus (SSC) annotation is a very recent direction of corpus development which relies on multiple systems instead of human annotators. In this paper, we investigate the practical usability of an SSC when a machine learning system is trained on it and tested on an unseen benchmark GSC. The main focus of this paper is how an SSC can be maximally exploited. In this process, we inspect several hypotheses which might have influenced the idea of SSC creation. Empirical results suggest that some of the hypotheses (e.g. a positive impact of a large SSC despite of having wrong and missing annotations) are not fully correct. We show that it is possible to automatically improve the quality and the quantity of the SSC annotations. We also observe that considering only those sentences of SSC which contain annotations rather than the full SSC results in a performance boost.