Clustering weblogs on the basis of a topic detection method

Authors:
Fernando Perez-Tellez;David Pinto;John Cardiff;Paolo Rosso
Affiliations:
Institute of Technology Tallaght Dublin, Ireland;Benemérita Universidad Autónoma de Puebla, Mexico;Institute of Technology Tallaght Dublin, Ireland;Natural Language Engineering Lab, ELiRF, Universidad Pólitecnica de Valencia, Spain
Venue:
MCPR'10 Proceedings of the 2nd Mexican conference on Pattern recognition: Advances in pattern recognition
Year:
2010

Citing 14
Cited 1

Concept based query expansion

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
On-line new event detection and tracking

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Explorations in Automatic Thesaurus Discovery

Explorations in Automatic Thesaurus Discovery
An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Latent dirichlet allocation

The Journal of Machine Learning Research
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Topic Detection in the news domain

ISICT '04 Proceedings of the 2004 international symposium on Information and communication technologies
Topic Detection from Blog Documents Using Users' Interests

MDM '06 Proceedings of the 7th International Conference on Mobile Data Management
Enhancing clustering blog documents by utilizing author/reader comments

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Topic Detection by Clustering Keywords

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Clustering Blogs with Collective Wisdom

ICWE '08 Proceedings of the 2008 Eighth International Conference on Web Engineering
Characterizing weblog corpora

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems

Detecting topic labels for tweets by matching features from pseudo-relevance feedback

AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years we have seen a vast increase in the volume of information published on weblog sites and also the creation of new web technologies where people discuss actual events. The need for automatic tools to organize this massive amount of information is clear, but the particular characteristics of weblogs such as shortness and overlapping vocabulary make this task difficult. In this work, we present a novel methodology to cluster weblog posts according to the topics discussed therein. This methodology is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. We present our results which demonstrate a considerable improvement over the baseline.