Urdu text classification

Authors:
Abbas Raza Ali;Maliha Ijaz
Affiliations:
National University of Computers and Emerging Sciences, Lahore, Pakistan;National University of Computers and Emerging Sciences, Lahore, Pakistan
Venue:
Proceedings of the 7th International Conference on Frontiers of Information Technology
Year:
2009

Citing 9
Cited 0

A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Naive Bayes models for probability estimation

ICML '05 Proceedings of the 22nd international conference on Machine learning
Efficient Text Classification by Weighted Proximal SVM

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Introduction to Information Retrieval

Introduction to Information Retrieval
Transferring naive bayes classifiers for text classification

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy.