A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Authors:
David Pinto;Paolo Rosso;Héctor Jiménez-Salazar
Affiliations:
-;-;-
Venue:
The Computer Journal
Year:
2011

Citing 0
Cited 6

On the difficulty of clustering microblog texts for online reputation management

WASSA '11 Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis
Instance selection in text classification using the silhouette coefficient measure

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science
A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation
Analysis of short texts on the Web: introduction to special issue

Language Resources and Evaluation
Distributional term representations for short-text categorization

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a self-term expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.