Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Representation quality in text classification: an introduction and experiment
HLT '90 Proceedings of the workshop on Speech and Natural Language
Foundations of statistical natural language processing
Foundations of statistical natural language processing
BoosTexter: A Boosting-based Systemfor Text Categorization
Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Combining Statistical and Relational Methods for Learning in Hypertext Domains
ILP '98 Proceedings of the 8th International Workshop on Inductive Logic Programming
Text categorization by boosting automatically extracted concepts
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms
Journal of Intelligent Information Systems
Augmenting Naive Bayes Classifiers with Statistical Language Models
Information Retrieval
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Word sense disambiguation vs. statistical machine translation
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Domain kernels for word sense disambiguation
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Boosting for text classification with semantic features
WebKDD'04 Proceedings of the 6th international conference on Knowledge Discovery on the Web: advances in Web Mining and Web Usage Analysis
Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge
Proceedings of the 20th ACM international conference on Information and knowledge management
Hi-index | 0.00 |
Classification algorithms and document representation approaches are two key elements for a successful document classification system. In the past, much work has been conducted to find better ways to represent documents. However, most of the attempts rely on certain extra resources such as WordNet, or they face the problem of extremely high dimension. In this paper, we propose a new document representation approach based on n-multigram language models. This approach can automatically discover the hidden semantic sequences in the documents under each category. Based on n-multigram language models and n-gram language models, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram models alone can achieve the similar or even better classification performance compared with the classifier based on n-gram models but the model size of our algorithm is much smaller than that of the latter. Another proposed algorithm based on the combination of n-multigram models and n-gram models improves the micro-F1 and macro-F1 values from 89.5% to 92.6% and 87.2% to 91.1% respectively. All these observations support the validity of our proposed document representation approach.