An enhanced ACO algorithm to select features for text categorization and its parallelization

  • Authors:
  • M. Janaki Meena;K. R. Chandran;A. Karthik;A. Vijay Samuel

  • Affiliations:
  • Department of CSE, PSG College of Technology, Coimbatore, Tamil Nadu 641004, India;Department of IT, PSG College of Technology, Coimbatore, Tamil Nadu 641004, India;Department of CSE, PSG College of Technology, Coimbatore, Tamil Nadu 641004, India;Department of CSE, PSG College of Technology, Coimbatore, Tamil Nadu 641004, India

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2012

Quantified Score

Hi-index 12.05

Visualization

Abstract

Feature selection is an indispensable preprocessing step for effective analysis of high dimensional data. It removes irrelevant features, improves the predictive accuracy and increases the comprehensibility of the model constructed by the classifiers sensitive to features. Finding an optimal feature subset for a problem in an outsized domain becomes intractable and many such feature selection problems have been shown to be NP-hard. Optimization algorithms are frequently designed for NP-hard problems to find nearly optimal solutions with a practical time complexity. This paper formulates the text feature selection problem as a combinatorial problem and proposes an Ant Colony Optimization (ACO) algorithm to find the nearly optimal solution for the same. It differs from the earlier algorithm by Aghdam et al. by including a heuristic function based on statistics and a local search. The algorithm aims at determining a solution that includes 'n' distinct features for each category. Optimization algorithms based on wrapper models show better results but the processes involved in them are time intensive. The availability of parallel architectures as a cluster of machines connected through fast Ethernet has increased the interest on parallelization of algorithms. The proposed ACO algorithm was parallelized and demonstrated with a cluster formed with a maximum of six machines. Documents from 20 newsgroup benchmark dataset were used for experimentation. Features selected by the proposed algorithm were evaluated using Naive bayes classifier and compared with the standard feature selection techniques. It was observed that the performance of the classifier had been improved with the features selected by the enhanced ACO and local search. Error of the classifier decreases over iterations and it was observed that the number of positive features increases with the number of iterations.