On the assessment of text corpora

  • Authors:
  • David Pinto;Paolo Rosso;Héctor Jiménez-Salazar

  • Affiliations:
  • Faculty of Computer Science, B. Autonomous University of Puebla, Mexico;Natural Language Engineering Lab. - ELiRF, Universidad Politécnica de Valencia, Spain;Department of Information Technologies, Autonomous Metropolitan University, Mexico

  • Venue:
  • NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.