Clustering weblogs on the basis of a topic detection method

  • Authors:
  • Fernando Perez-Tellez;David Pinto;John Cardiff;Paolo Rosso

  • Affiliations:
  • Institute of Technology Tallaght Dublin, Ireland;Benemérita Universidad Autónoma de Puebla, Mexico;Institute of Technology Tallaght Dublin, Ireland;Natural Language Engineering Lab, ELiRF, Universidad Pólitecnica de Valencia, Spain

  • Venue:
  • MCPR'10 Proceedings of the 2nd Mexican conference on Pattern recognition: Advances in pattern recognition
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years we have seen a vast increase in the volume of information published on weblog sites and also the creation of new web technologies where people discuss actual events. The need for automatic tools to organize this massive amount of information is clear, but the particular characteristics of weblogs such as shortness and overlapping vocabulary make this task difficult. In this work, we present a novel methodology to cluster weblog posts according to the topics discussed therein. This methodology is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. We present our results which demonstrate a considerable improvement over the baseline.