A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Authors:
Jieming Yang;Yuanning Liu;Zhen Liu;Xiaodong Zhu;Xiaoxu Zhang
Affiliations:
College of Computer Science and Technology, Jilin University, Changchun, Jilin, China and College of Information Engineering, Northeast Dianli University, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China and Graduate School of Engineering, Nagasaki Institute of Applied Science, Nagasaki-shi, Nagasaki, Japan;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China;College of Computer Science and Technology, Jilin University, Changchun, Jilin, China
Venue:
Knowledge-Based Systems
Year:
2011

Citing 29
Cited 5

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A statistical approach to the spam problem

Linux Journal
Feature selection on hierarchy of web documents

Decision Support Systems - Web retrieval and mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Best terms: an efficient feature-selection algorithm for text categorization

Knowledge and Information Systems
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
A comparative study on text representation schemes in text categorization

Pattern Analysis & Applications
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
A novel feature selection algorithm for text categorization

Expert Systems with Applications: An International Journal
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
On the chance accuracies of large collections of classifiers

Proceedings of the 25th international conference on Machine learning
Short communication: Recommendation based on rational inferences in collaborative filtering

Knowledge-Based Systems
Feature selection with a measure of deviations from Poisson in text categorization

Expert Systems with Applications: An International Journal
Feature selection for text classification with Naïve Bayes

Expert Systems with Applications: An International Journal
Class dependent feature scaling method using naive Bayes classifier for text datamining

Pattern Recognition Letters
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

Soft Computing - A Fusion of Foundations, Methodologies and Applications
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Ambiguity measure feature-selection algorithm

Journal of the American Society for Information Science and Technology
Combining neural networks and semantic feature space for email classification

Knowledge-Based Systems
Information gain and divergence-based feature selection for machine learning-based text categorization

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Short communication: New results in modelling derived from Bayesian filtering

Knowledge-Based Systems
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
All of Statistics: A Concise Course in Statistical Inference

All of Statistics: A Concise Course in Statistical Inference
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Dimensionality reduction and main component extraction of mass spectrometry cancer data

Knowledge-Based Systems
Classifying credit ratings for Asian banks using integrating feature selection and the CPDA-based rough sets approach

Knowledge-Based Systems
A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Information Processing and Management: an International Journal
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
A hybrid Gini PSO-SVM feature selection based on Taguchi method: an evaluation on email filtering

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, @g^2-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than @g^2-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naive Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms.