Text classification improved through multigram models

Authors:
Dou Shen;Jian-Tao Sun;Qiang Yang;Zheng Chen
Affiliations:
Hong Kong University of Science and Technology;Microsoft Research Asia, Beijing, P. R. China;Hong Kong University of Science and Technology;Microsoft Research Asia, Beijing, P. R. China
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 16
Cited 1

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Representation quality in text classification: an introduction and experiment

HLT '90 Proceedings of the workshop on Speech and Natural Language
Foundations of statistical natural language processing

Foundations of statistical natural language processing
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Combining Statistical and Relational Methods for Learning in Hypertext Domains

ILP '98 Proceedings of the 8th International Workshop on Inductive Logic Programming
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms

Journal of Intelligent Information Systems
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Word sense disambiguation vs. statistical machine translation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Domain kernels for word sense disambiguation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Boosting for text classification with semantic features

WebKDD'04 Proceedings of the 6th international conference on Knowledge Discovery on the Web: advances in Web Mining and Web Usage Analysis

Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification algorithms and document representation approaches are two key elements for a successful document classification system. In the past, much work has been conducted to find better ways to represent documents. However, most of the attempts rely on certain extra resources such as WordNet, or they face the problem of extremely high dimension. In this paper, we propose a new document representation approach based on n-multigram language models. This approach can automatically discover the hidden semantic sequences in the documents under each category. Based on n-multigram language models and n-gram language models, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram models alone can achieve the similar or even better classification performance compared with the classifier based on n-gram models but the model size of our algorithm is much smaller than that of the latter. Another proposed algorithm based on the combination of n-multigram models and n-gram models improves the micro-F1 and macro-F1 values from 89.5% to 92.6% and 87.2% to 91.1% respectively. All these observations support the validity of our proposed document representation approach.