Integrated instance- and class-based generative modeling for text classification

Authors:
Antti Puurula;Sung-Hyon Myaeng
Affiliations:
The University of Waikato, Hamilton, New Zealand;KAIST, Daejeon, South Korea
Venue:
Proceedings of the 18th Australasian Document Computing Symposium
Year:
2013

Citing 19
Cited 0

Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Combining instance-based learning and logistic regression for multilabel classification

Machine Learning
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research
Learning word vectors for sentiment analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Probabilistic topic models

Communications of the ACM
Efficient multilabel classification algorithms for large-scale problems in the legal domain

Semantic Processing of Legal Texts
Nearest neighbor pattern classification

IEEE Transactions on Information Theory
The estimation of the gradient of a density function, with applications in pattern recognition

IEEE Transactions on Information Theory
Ensemble of exemplar-SVMs for object detection and beyond

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision
Sentiment classification with supervised sequence embedding

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.