Assessing the practical usability of an automatically annotated corpus

  • Authors:
  • Faisal Mahbub Chowdhury;Alberto Lavelli

  • Affiliations:
  • University of Trento, Italy and Human Language Technology Research Unit, Fondazione Bruno Kessler, Trento, Italy;Human Language Technology Research Unit, Fondazione Bruno Kessler, Trento, Italy

  • Venue:
  • LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The creation of a gold standard corpus (GSC) is a very laborious and costly process. Silver standard corpus (SSC) annotation is a very recent direction of corpus development which relies on multiple systems instead of human annotators. In this paper, we investigate the practical usability of an SSC when a machine learning system is trained on it and tested on an unseen benchmark GSC. The main focus of this paper is how an SSC can be maximally exploited. In this process, we inspect several hypotheses which might have influenced the idea of SSC creation. Empirical results suggest that some of the hypotheses (e.g. a positive impact of a large SSC despite of having wrong and missing annotations) are not fully correct. We show that it is possible to automatically improve the quality and the quantity of the SSC annotations. We also observe that considering only those sentences of SSC which contain annotations rather than the full SSC results in a performance boost.