Particle Swarm Optimization for clustering short-text corpora

  • Authors:
  • Diego Ingaramo;Marcelo Errecalde;Leticia Cagnina;Paolo Rosso

  • Affiliations:
  • LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina. e-mail: {daingara, merreca, lcagnina}@unsl.edu.ar;LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina. e-mail: {daingara, merreca, lcagnina}@unsl.edu.ar;LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina. e-mail: {daingara, merreca, lcagnina}@unsl.edu.ar;Natural Language Engineering Lab., Department of Information Systems and Computation, Universidad Politécnica de Valencia, Spain. e-mail: prosso@dsic.upv.es

  • Venue:
  • Proceedings of the 2009 conference on Computational Intelligence and Bioengineering: Essays in Memory of Antonina Starita
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering of short-text collections is a very relevant research area, given the current and future mode for people to use “small-language” (e.g. blogs, snippets, news and text-message generation such as email or chat). In recent years, a few approaches based on Particle Swarm Optimization (PSO) have been proposed to solve document clustering problems. However, the particularities that arise when this kind of approaches are used for clustering corpora containing very short documents have not received too much attention by the computational linguistic community, maybe due to the high challenge that this problem implies. In this work, we propose some variants of PSO methods to deal with this kind of corpora. Our proposal includes two very different approaches to the clustering problem, which essentially differ in the representations used for maintaining the information about the clusterings under consideration. In our approach, we used two unsupervised measures of cluster validity to be optimized: the Expected Density Measure $\bar{\rho}$ and the Global Silhouette coefficient. In recent works on short-text clustering, these measures have shown an interesting correlation level with the “true” categorizations provided by a human expert. The experimental results show that PSO-based approaches can be highly competitive alternatives for clustering short-text corpora and can, in some cases, outperform the performance of the most effective clustering algorithms used in this area.