Best terms: an efficient feature-selection algorithm for text categorization

Authors:
Dimitris Fragoudis;Dimitris Meretakis;Spiridon Likothanassis_aff1n3
Affiliations:
Computer Engineering and Informatics Department, University of Patras, GR-26500, Rio—Patras, Greece and aff3 Computer Technology Institute, Patras, Greece;Novartis Pharma, Griffith University, GR-26500, Basel, Switzerland;Computer Engineering and Informatics Department, University of Patras, GR-26500, Rio—Patras, Greece and aff3 Computer Technology Institute, Patras, Greece
Venue:
Knowledge and Information Systems
Year:
2005

Citing 17
Cited 6

C4.5: programs for machine learning

C4.5: programs for machine learning
The nature of statistical learning theory

The nature of statistical learning theory
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical neural networks for text categorization (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
On feature distributional clustering for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language

Document-Base Extraction for Single-Label Text Classification

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Hybrid DIAAF/RS: statistical textual feature selection for language-independent text classification

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Knowledge-Based Systems
A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Information Processing and Management: an International Journal
Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.