Text Classification Improved through Automatically Extracted Sequences

Authors:
Dou Shen;Jian-Tao Sun;Qiang Yang;Hui Zhao;Zheng Chen
Affiliations:
Hong Kong University of Science and Technology;Microsoft Research Asia;Hong Kong University of Science and Technology;Hong Kong University of Science and Technology;Microsoft Research Asia
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 2

Sampling the Web as Training Data for Text Classification

International Journal of Digital Library Systems
A pattern based two-stage text classifier

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose to use the n-multigram model to help the automatic text classification task. This model could automatically discover the latent semantic sequences contained in the document set of each category. Based on the n-multigram model and the n-gram language model, we put forward two text classification algorithms. The experiments on RCV1 show that our proposed algorithm based on n-multigram model can achieve the similar classification performance compared with the one based on n-gram model. However, the model size of our algorithm is only 4.21% of the latter one. Another proposed algorithm based on the combination of nmultigram model and n-gram model improves the micro- F1 and macro-F1 values by 3.5% and 4.5% respectively which support the validity of our approach.