BNS feature scaling: an improved representation over tf-idf for svm text classification

Authors:
George Forman
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA, USA
Venue:
Proceedings of the 17th ACM conference on Information and knowledge management
Year:
2008

Citing 14
Cited 10

A Review and Empirical Evaluation of Feature Weighting Methods for aClass of Lazy Learning Algorithms

Artificial Intelligence Review - Special issue on lazy learning
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Foundations of statistical natural language processing

Foundations of statistical natural language processing
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Feature shaping for linear SVM classifiers

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Boosting KNN text classification accuracy by using supervised term weighting schemes

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic satire detection: are you having a laugh?

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Quadratic Programming Feature Selection

The Journal of Machine Learning Research
Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

ACM SIGKDD Explorations Newsletter
An exploratory study of news article clustering for web-based bio-surveillance

Proceedings of the 1st ACM International Health Informatics Symposium
Evaluation of feature combination approaches for text categorisation

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Feature sub-set selection metrics for Arabic text classification

Pattern Recognition Letters
Features' weight learning towards improved query classification

AIS'12 Proceedings of the Third international conference on Autonomous and Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the realm of machine learning for text classification, TF-IDF is the most widely used representation for real-valued feature vectors. However, IDF is oblivious to the training class labels and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which has been previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using BNS scaling. A wide variety of other feature representations were later tested and found inferior, as well as binary features with no scaling. Moreover, BNS scaling yielded better performance without feature selection, obviating the need for feature selection.