Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Authors:
David Pinto;José-Miguel Benedí;Paolo Rosso
Affiliations:
Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/n, Spain and Faculty of Computer Science, BUAP, Puebla 72570, Ciudad Universitaria, Mexico;Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/n, Spain;Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/n, Spain
Venue:
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 13
Cited 8

Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Similarity-Based Models of Word Cooccurrence Probabilities

Machine Learning - Special issue on natural language learning
A fuzzy decision strategy for topic identification and dynamic selection of language models

Signal Processing - Special issue on fuzzy logic in signal processing
An information-theoretic approach to automatic query expansion

ACM Transactions on Information Systems (TOIS)
Spoken Dialogues with Computers

Spoken Dialogues with Computers
Information Retrieval

Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature weighting for co-occurrence-based classification of words

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Using Kullback-Leibler distance for text categorization

ECIR'03 Proceedings of the 25th European conference on IR research
Fast clustering algorithm for information organization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
An approach to clustering abstracts

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Information distance

IEEE Transactions on Information Theory

ITSA*: an effective iterative method for short-text clustering tasks

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Beyond precision@10: clustering the long tail of web search results

Proceedings of the 20th ACM international conference on Information and knowledge management
On the assessment of text corpora

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
A general bio-inspired method to improve the short-text clustering task

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Clustering short text and its evaluation

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Keyphrase extraction through query performance prediction

Journal of Information Science
Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science
An efficient Particle Swarm Optimization approach to cluster short texts

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.