Augmenting Naive Bayes Classifiers with Statistical Language Models

Authors:
Fuchun Peng;Dale Schuurmans;Shaojun Wang
Affiliations:
Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts at Amherst, 140 Governors Drive, Amherst, MA, USA 01003. fuchun@cs.umass.edudale@cs.ualberta.ca;Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, T6G 2E8. swang@cs.ualberta.ca
Venue:
Information Retrieval
Year:
2004

Citing 25
Cited 46

Text compression

Text compression
Self-organized language modeling for speech recognition

Readings in speech recognition
Representation and learning in information retrieval

Representation and learning in information retrieval
The design and analysis of efficient lossless data compression systems

The design and analysis of efficient lossless data compression systems
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Employing multiple representations for Chinese information retrieval

Journal of the American Society for Information Science
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Statistical phrases for vector-space information retrieval (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Learnability of Augmented Naive Bayes in Nonimal Domains

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
On Machine Learning Methods for Chinese Document Categorization

Applied Intelligence
Text Mining: A New Frontier for Lossless Compression

DCC '99 Proceedings of the Conference on Data Compression
Automatic text categorization in terms of genre and author

Computational Linguistics
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Combining Statistical Language Models via the Latent Maximum Entropy Principle

Machine Learning
Determining an author's native language by mining a text for errors

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Generalized Naive Bayes Classifiers

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Discriminatively Trained Markov Model for Sequence Classification

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Machine learning for Arabic text categorization: Research Articles

Journal of the American Society for Information Science and Technology
Effective identification of source code authors using byte-level information

Proceedings of the 28th international conference on Software engineering
Building bridges for web query classification

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Authorship attribution with thousands of candidate authors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Text classification improved through multigram models

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Using query contexts in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Ontology-supported polarity mining

Journal of the American Society for Information Science and Technology
Personal name classification in web queries

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Examining the significance of high-level programming features in source code author classification

Journal of Systems and Software
Exploring hedge identification in biomedical literature

Journal of Biomedical Informatics
Deep classification in large-scale text hierarchies

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Fast logistic regression for text categorization with variable-length n-grams

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A Language Modelling Approach to Linking Criminal Styles with Offender Characteristics

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A Web-Based Self-training Approach for Authorship Attribution

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Adapting information retrieval to query contexts

Information Processing and Management: an International Journal
Neural networks letter: LAGO on the unit sphere

Neural Networks
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Using the Web as corpus for self-training text categorization

Information Retrieval
A statistical approach to crosslingual natural language tasks

Journal of Algorithms
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Automatic dimensionality selection from the scree plot via the use of profile likelihood

Computational Statistics & Data Analysis
A Language Modelling approach to linking criminal styles with offender characteristics

Data & Knowledge Engineering
Combining global and local information for enhanced deep classification

Proceedings of the 2010 ACM Symposium on Applied Computing
Mining police digital archives to link criminal styles with offender characteristics

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
An approach to indexing and clustering news stories using continuous language models

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Toward a semantic granularity model for domain-specific information retrieval

ACM Transactions on Information Systems (TOIS)
An alternative approach for statistical single-label document classification of newspaper articles

Journal of Information Science
Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Authorship attribution using word sequences

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Automatically determining an anonymous author's native language

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics
Improving tweet stream classification by detecting changes in word probability

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Buy it - don't buy it: sentiment classification on amazon reviews using sentence polarity shift

PRICAI'12 Proceedings of the 12th Pacific Rim international conference on Trends in Artificial Intelligence
Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

Expert Systems with Applications: An International Journal
Information fusion in taxonomic descriptions

Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing
Utilizing global and path information with language modelling for hierarchical text classification

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the Chain Augmented Naive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes—allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems—authorship attribution, text genre classification, and topic detection—in several languages—Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model.