Particle Swarm Optimization for clustering short-text corpora

Authors:
Diego Ingaramo;Marcelo Errecalde;Leticia Cagnina;Paolo Rosso
Affiliations:
LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina. e-mail: {daingara, merreca, lcagnina}@unsl.edu.ar;LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina. e-mail: {daingara, merreca, lcagnina}@unsl.edu.ar;LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina. e-mail: {daingara, merreca, lcagnina}@unsl.edu.ar;Natural Language Engineering Lab., Department of Information Systems and Computation, Universidad Politécnica de Valencia, Spain. e-mail: prosso@dsic.upv.es
Venue:
Proceedings of the 2009 conference on Computational Intelligence and Bioengineering: Essays in Memory of Antonina Starita
Year:
2009

Citing 13
Cited 2

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Particle swarm optimization for multimodal functions: a clustering approach

Journal of Artificial Evolution and Applications - Particle Swarms: The Second Decade
A discrete particle swarm optimization algorithm for uncapacitated facility location problem

Journal of Artificial Evolution and Applications - Particle Swarms: The Second Decade
Proximity Estimation and Hardness of Short-Text Corpora

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
On the relative hardness of clustering corpora

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Evaluation of internal validity measures in short-text corpora

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
An approach to clustering abstracts

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

ITSA*: an effective iterative method for short-text clustering tasks

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
A general bio-inspired method to improve the short-text clustering task

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering of short-text collections is a very relevant research area, given the current and future mode for people to use “small-language” (e.g. blogs, snippets, news and text-message generation such as email or chat). In recent years, a few approaches based on Particle Swarm Optimization (PSO) have been proposed to solve document clustering problems. However, the particularities that arise when this kind of approaches are used for clustering corpora containing very short documents have not received too much attention by the computational linguistic community, maybe due to the high challenge that this problem implies. In this work, we propose some variants of PSO methods to deal with this kind of corpora. Our proposal includes two very different approaches to the clustering problem, which essentially differ in the representations used for maintaining the information about the clusterings under consideration. In our approach, we used two unsupervised measures of cluster validity to be optimized: the Expected Density Measure $\bar{\rho}$ and the Global Silhouette coefficient. In recent works on short-text clustering, these measures have shown an interesting correlation level with the “true” categorizations provided by a human expert. The experimental results show that PSO-based approaches can be highly competitive alternatives for clustering short-text corpora and can, in some cases, outperform the performance of the most effective clustering algorithms used in this area.