Topic classification of blog posts using distant supervision

Authors:
Stephanie D. Husby;Denilson Barbosa
Affiliations:
University of Alberta;University of Alberta
Venue:
Proceedings of the Workshop on Semantic Analysis in Social Media
Year:
2012

Citing 10
Cited 2

Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Personal vs non-personal blogs: initial classification experiments

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Semi-supervised learning for blog classification

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Distant supervision for relation extraction without labeled data

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence

Automatic discovery of web content related to IT in the mexican internet based on supervised classifiers

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Harnessing web page directories for large-scale classification of tweets

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classifying blog posts by topics is useful for applications such as search and marketing. However, topic classification is time consuming and error prone, especially in an open domain such as the blogosphere. The state-of-the-art relies on supervised methods, requiring considerable training effort, that use the whole corpus vocabulary as features, demanding considerable memory to process. We show an effective alternative whereby distant supervision is used to obtain training data: we use Wikipedia articles labelled with Freebase domains. We address the memory requirements by using only named entities as features. We test our classifier on a sample of blog posts, and report up to 0.69 accuracy for multi-class labelling and 0.9 for binary classification.