Evaluation of internal validity measures in short-text corpora

Authors:
Diego Ingaramo;David Pinto;Paolo Rosso;Marcelo Errecalde
Affiliations:
Development and Research Laboratory in Computacional Intelligence, UNSL, Argentina;Natural Language Engineering Lab., Department of Information Systems and Computation, Polytechnic University of Valencia, Spain and Faculty of Computer Science, BUAP, Mexico;Natural Language Engineering Lab., Department of Information Systems and Computation, Polytechnic University of Valencia, Spain;Development and Research Laboratory in Computacional Intelligence, UNSL, Argentina
Venue:
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Year:
2008

Citing 7
Cited 7

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Semeval-2007 task 02: evaluating word sense induction and discrimination systems

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
On the relative hardness of clustering corpora

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Fast clustering algorithm for information organization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
An approach to clustering abstracts

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

Particle Swarm Optimization for clustering short-text corpora

Proceedings of the 2009 conference on Computational Intelligence and Bioengineering: Essays in Memory of Antonina Starita
ITSA*: an effective iterative method for short-text clustering tasks

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
A general bio-inspired method to improve the short-text clustering task

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Clustering and categorization of Brazilian portuguese legal documents

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science
Distributional term representations for short-text categorization

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
An efficient Particle Swarm Optimization approach to cluster short texts

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Short texts clustering is one of the most difficult tasks in natural language processing due to the low frequencies of the document terms. We are interested in analysing these kind of corpora in order to develop novel techniques that may be used to improve results obtained by classical clustering algorithms. In this paper we are presenting an evaluation of different internal clustering validity measures in order to determine the possible correlation between these measures and that of the F-Measure, a well-known external clustering measure used to calculate the performance of clustering algorithms. We have used several short-text corpora in the experiments carried out. The obtained correlation with a particular set of internal validity measures let us to conclude that some of them may be used to improve the performance of text clustering algorithms.