Topic classification of blog posts using distant supervision

  • Authors:
  • Stephanie D. Husby;Denilson Barbosa

  • Affiliations:
  • University of Alberta;University of Alberta

  • Venue:
  • Proceedings of the Workshop on Semantic Analysis in Social Media
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classifying blog posts by topics is useful for applications such as search and marketing. However, topic classification is time consuming and error prone, especially in an open domain such as the blogosphere. The state-of-the-art relies on supervised methods, requiring considerable training effort, that use the whole corpus vocabulary as features, demanding considerable memory to process. We show an effective alternative whereby distant supervision is used to obtain training data: we use Wikipedia articles labelled with Freebase domains. We address the memory requirements by using only named entities as features. We test our classifier on a sample of blog posts, and report up to 0.69 accuracy for multi-class labelling and 0.9 for binary classification.