An improved hierarchical Bayesian model of language for document classification

Authors:
Ben Allison
Affiliations:
University of Sheffield, UK
Venue:
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Year:
2008

Citing 10
Cited 3

Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Distribution of content words and phrases in text and language modelling

Natural Language Engineering
Document classification by machine: theory and practice

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Parametric models of linguistic count data

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Principled Hybrids of Generative and Discriminative Models

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1

A technique for improving the performance of naive bayes text classification

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Generating exact- and ranked partially-matched answers to questions in advertisements

Proceedings of the VLDB Endowment
Web-based closed-domain data extraction on online advertisements

Information Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper addresses the fundamental problem of document classification, and we focus attention on classification problems where the classes are mutually exclusive. In the course of the paper we advocate an approximate sampling distribution for word counts in documents, and demonstrate the model's capacity to outperform both the simple multinomial and more recently proposed extensions on the classification task. We also compare the classifiers to a linear SVM, and show that provided certain conditions are met, the new model allows performance which exceeds that of the SVM and attains amongst the very best published results on the Newsgroups classification task.