Class-based n-gram models of natural language
Computational Linguistics
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Authorship Attribution with Support Vector Machines
Applied Intelligence
A flexible POS tagger using an automatically acquired language model
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles
Journal of the American Society for Information Science and Technology
Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Semeval-2007 task 02: evaluating word sense induction and discrimination systems
SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
A Maximum Likelihood Approach to Continuous Speech Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Estimating mutual information for feature selection in the presence of label noise
Computational Statistics & Data Analysis
Hi-index | 0.00 |
Classifier-independent measures are important to assess the quality of corpora. In this paper we present supervised and unsupervised measures in order to analyse several data collections for studying the following features: domain broadness, shortness, class imbalance, and stylometry. We found that the investigated assessment measures may allow to evaluate the quality of gold standards. Moreover, they could also be useful for classification systems in order to take strategical decisions when tackling some specific text collections.