Characterizing weblog corpora

Authors:
Fernando Perez-Tellez;David Pinto;John Cardiff;Paolo Rosso
Affiliations:
Social Media Research Group, Institute of Technology Tallaght, Dublin, Ireland;Benemerita Universidad Autónoma de Puebla, Mexico;Social Media Research Group, Institute of Technology Tallaght, Dublin, Ireland;Natural Language Engineering Lab. – EliRF, Dept. Sistemas Informáticos y Computación, Universidad Politécnica Valencia, Spain
Venue:
NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Year:
2009

Citing 1
Cited 2

UPV-SI: word sense induction using self term expansion

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations

Improving the Clustering of Blogosphere with a Self-term Enriching Technique

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
Clustering weblogs on the basis of a topic detection method

MCPR'10 Proceedings of the 2nd Mexican conference on Pattern recognition: Advances in pattern recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not extensive documents and exhibit undesirable characteristics from a clustering perspective such as low frequency terms, short vocabulary size and vocabulary overlapping of some domains. Furthermore, their characteristics vary widely depending on the specific interests of the writer, their linguistic style, and the volume of texts that they produce.