A parallel ACO algorithm to select terms to categorise longer documents

  • Authors:
  • M. Janaki Meena;K. R. Chandran;A. Karthik;A. Vijay Samuel

  • Affiliations:
  • Department of CSE, PSG College of Technology, Coimbatore - 641004, Tamilnadu, India.;Department of IT, PSG College of Technology, Coimbatore - 641004, Tamilnadu, India.;Department of CSE, PSG College of Technology, Coimbatore - 641004, Tamilnadu, India.;Department of CSE, PSG College of Technology, Coimbatore - 641004, Tamilnadu, India

  • Venue:
  • International Journal of Computational Science and Engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text categorisation (TC) is the task of assigning predefined categories to text. The primary step in TC is to transform documents into a representation suitable for machine learning algorithms. Bag of Words is the most popular document representation. Most of the machine learning algorithms are sensitive to the features fed into it and are misled by the high dimensionality of text. Feature selection (FS) is an important preprocessing step to remove redundant and irrelevant terms in the training corpus. This paper proposes an ant colony optimization (ACO) algorithm to select features for categorizing longer documents whose categories are closely related. Heuristic value for each word is computed by the statistical dependency of the term to a category and its compactness value. Compactness of a term indicates its spread in a document. Experiments were conducted with documents from 20 newsgroup and Reuters-21578 benchmarks. The selected features were fed into the naïve Bayes classifier and its performance was analysed. It was observed that the performance of the classifier improves with the features selected by the proposed method. The processes involved in algorithm are time intensive and demands parallelism. Hence the ACO algorithm was parallelised using the MapReduce programming model. The parallel algorithm was implemented and tested with a cluster of six machines formed using Hadoop.