Incorporating non-local information into information extraction systems by Gibbs sampling
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Proceedings of the 17th international conference on World Wide Web
Personal vs non-personal blogs: initial classification experiments
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Semi-supervised learning for blog classification
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Distant supervision for relation extraction without labeled data
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Estimating continuous distributions in Bayesian classifiers
UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Harnessing web page directories for large-scale classification of tweets
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
Classifying blog posts by topics is useful for applications such as search and marketing. However, topic classification is time consuming and error prone, especially in an open domain such as the blogosphere. The state-of-the-art relies on supervised methods, requiring considerable training effort, that use the whole corpus vocabulary as features, demanding considerable memory to process. We show an effective alternative whereby distant supervision is used to obtain training data: we use Wikipedia articles labelled with Freebase domains. We address the memory requirements by using only named entities as features. We test our classifier on a sample of blog posts, and report up to 0.69 accuracy for multi-class labelling and 0.9 for binary classification.