Boosting KNN text classification accuracy by using supervised term weighting schemes

Authors:
Iyad Batal;Milos Hauskrecht
Affiliations:
University of Pittsburgh, Pittsburgh, PA, USA;University of Pittsburgh, Pittsburgh, PA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 15
Cited 1

The nature of statistical learning theory

The nature of statistical learning theory
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification using Local Latent Semantic Indexing

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
BNS feature scaling: an improved representation over tf-idf for svm text classification

Proceedings of the 17th ACM conference on Information and knowledge management

Projected-prototype based classifier for text categorization

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing availability of digital documents in the last decade has prompted the development of machine learning techniques to automatically classify and organize text documents. The majority of text classification systems rely on the vector space model, which represents the documents as vectors in the term space. Each vector component is assigned a weight that reflects the importance of the term in the document. Typically, these weights are assigned using an information retrieval (IR) approach, such as the famous tf-idf function. In this work, we study two weighting schemes based on information gain and chi-square statistics. These schemes take advantage of the category label information to weight the terms according to their distributions across the different categories. We show that using these supervised weights instead of conventional unsupervised weights can greatly improve the performance of the k-nearest neighbor (KNN) classifier. Experimental evaluations, carried out on multiple text classification tasks, demonstrate the benefits of this approach in creating accurate text classifiers.