On the assessment of text corpora

Authors:
David Pinto;Paolo Rosso;Héctor Jiménez-Salazar
Affiliations:
Faculty of Computer Science, B. Autonomous University of Puebla, Mexico;Natural Language Engineering Lab. - ELiRF, Universidad Politécnica de Valencia, Spain;Department of Information Technologies, Autonomous Metropolitan University, Mexico
Venue:
NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Year:
2009

Citing 8
Cited 1

Class-based n-gram models of natural language

Computational Linguistics
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Authorship Attribution with Support Vector Machines

Applied Intelligence
A flexible POS tagger using an automatically acquired language model

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Semeval-2007 task 02: evaluating word sense induction and discrimination systems

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
A Maximum Likelihood Approach to Continuous Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence

Estimating mutual information for feature selection in the presence of label noise

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.