Semi-supervised learning for blog classification

Authors:
Daisuke Ikeda;Hiroya Takamura;Manabu Okumura
Affiliations:
Department of Computational, Intelligence and Systems Science, Tokyo Institute of Technology, Yokohama, Kanagawa, Japan;Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama, Kanagawa, Japan;Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama, Kanagawa, Japan
Venue:
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Year:
2008

Citing 7
Cited 2

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

The Journal of Machine Learning Research
A high-performance semi-supervised learning method for text chunking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Applying alternating structure optimization to word sense disambiguation

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Domain adaptation with structural correspondence learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Automatic compilation of travel information from automatically identified travel blogs

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Topic classification of blog posts using distant supervision

Proceedings of the Workshop on Semantic Analysis in Social Media

Quantified Score

Hi-index	0.00

Visualization

Abstract

Blog classification (e.g., identifying bloggers' gender or age) is one of the most interesting current problems in blog analysis. Although this problem is usually solved by applying supervised learning techniques, the large labeled dataset required for training is not always available. In contrast, unlabeled blogs can easily be collected from the web. Therefore, a semi-supervised learning method for blog classification, effectively using unlabeled data, is proposed. In this method, entries from the same blog are assumed to have the same characteristics. With this assumption, the proposed method captures the characteristics of each blog, such as writing style and topic, and uses these characteristics to improve the classification accuracy.