Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Weakly supervised natural language learning without redundant views
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Bootstrapping POS taggers using unlabelled data
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Reranking and self-training for parser adaptation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A proposal for a configurable silver standard
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Disease mention recognition with specific features
BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Hi-index | 0.00 |
The creation of a gold standard corpus (GSC) is a very laborious and costly process. Silver standard corpus (SSC) annotation is a very recent direction of corpus development which relies on multiple systems instead of human annotators. In this paper, we investigate the practical usability of an SSC when a machine learning system is trained on it and tested on an unseen benchmark GSC. The main focus of this paper is how an SSC can be maximally exploited. In this process, we inspect several hypotheses which might have influenced the idea of SSC creation. Empirical results suggest that some of the hypotheses (e.g. a positive impact of a large SSC despite of having wrong and missing annotations) are not fully correct. We show that it is possible to automatically improve the quality and the quantity of the SSC annotations. We also observe that considering only those sentences of SSC which contain annotations rather than the full SSC results in a performance boost.