Evaluating text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
Selection of relevant features and examples in machine learning
Artificial Intelligence - Special issue on relevance
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
A statistical approach to the spam problem
Linux Journal
Feature selection on hierarchy of web documents
Decision Support Systems - Web retrieval and mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Best terms: an efficient feature-selection algorithm for text categorization
Knowledge and Information Systems
Feature selection and feature extraction for text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
A comparative study on text representation schemes in text categorization
Pattern Analysis & Applications
An introduction to ROC analysis
Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
A novel feature selection algorithm for text categorization
Expert Systems with Applications: An International Journal
Statistical Comparisons of Classifiers over Multiple Data Sets
The Journal of Machine Learning Research
On the chance accuracies of large collections of classifiers
Proceedings of the 25th international conference on Machine learning
Short communication: Recommendation based on rational inferences in collaborative filtering
Knowledge-Based Systems
Feature selection with a measure of deviations from Poisson in text categorization
Expert Systems with Applications: An International Journal
Feature selection for text classification with Naïve Bayes
Expert Systems with Applications: An International Journal
Class dependent feature scaling method using naive Bayes classifier for text datamining
Pattern Recognition Letters
Soft Computing - A Fusion of Foundations, Methodologies and Applications
Review: A review of machine learning approaches to Spam filtering
Expert Systems with Applications: An International Journal
Ambiguity measure feature-selection algorithm
Journal of the American Society for Information Science and Technology
Combining neural networks and semantic feature space for email classification
Knowledge-Based Systems
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Short communication: New results in modelling derived from Bayesian filtering
Knowledge-Based Systems
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
All of Statistics: A Concise Course in Statistical Inference
All of Statistics: A Concise Course in Statistical Inference
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Dimensionality reduction and main component extraction of mass spectrometry cancer data
Knowledge-Based Systems
Information Processing and Management: an International Journal
A novel probabilistic feature selection method for text classification
Knowledge-Based Systems
A hybrid Gini PSO-SVM feature selection based on Taguchi method: an evaluation on email filtering
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Hi-index | 0.00 |
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, @g^2-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than @g^2-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naive Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms.