Automated learning of RVM for large scale text sets: divide to conquer

Authors:
Catarina Silva;Bernardete Ribeiro
Affiliations:
School of Technology and Management of the Polytechnic Institute of Leiria, Morro do Lena – Alto do Vieiro, Portugal, Leiria, Portugal;Department of Informatics Engineering, Center for Informatics and Systems (CISUC), University of Coimbra, Coimbra, Portugal
Venue:
IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Year:
2006

Citing 7
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A scalability analysis of classifiers in text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Sparse bayesian learning and the relevance vector machine

The Journal of Machine Learning Research
Combining Pattern Classifiers: Methods and Algorithms

Combining Pattern Classifiers: Methods and Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Three methods are investigated and presented for automated learning of Relevance Vector Machines (RVM) in large scale text sets. RVM probabilistic Bayesian nature allows both predictive distributions on test instances and model-based selection yielding a parsimonious solution. However, scaling up the algorithm is not workable in most digital information processing applications. We look at the properties of the baseline RVM algorithm and propose new scaling approaches based on choosing appropriate working sets which retain the most informative data. Incremental, ensemble and boosting algorithms are deployed to improve classification performance by taking advantage of the large training set available. Results on Reuters-21578 are presented, showing performance gains and maintaining sparse solutions that can be deployed in distributed environments.