GAMoN: Discovering M-of-N{¬,∨} hypotheses for text classification by a lattice-based Genetic Algorithm

  • Authors:
  • Veronica L. Policicchio;Adriana Pietramala;Pasquale Rullo

  • Affiliations:
  • Dept. of Mathematics, University of Calabria, Italy;Dept. of Mathematics, University of Calabria, Italy;Dept. of Mathematics, University of Calabria, Italy

  • Venue:
  • Artificial Intelligence
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

While there has been a long history of rule-based text classifiers, to the best of our knowledge no M-of-N-based approach for text categorization has so far been proposed. In this paper we argue that M-of-N hypotheses are particularly suitable to model the text classification task because of the so-called ''family resemblance'' metaphor: ''the members (i.e., documents) of a family (i.e., category) share some small number of features, yet there is no common feature among all of them. Nevertheless, they resemble each other''. Starting from this conjecture, we provide a sound extension of the M-of-N approach with negation and disjunction, called M-of-N^{^@?^,^@?^}, which enables to best fit the true structure of the data. Based on a thorough theoretical study, we show that the M-of-N^{^@?^,^@?^} hypothesis space has two partial orders that form complete lattices. GAMoN is the task-specific Genetic Algorithm (GA) which, by exploiting the lattice-based structure of the hypothesis space, efficiently induces accurate M-of-N^{^@?^,^@?^} hypotheses. Benchmarking was performed over 13 real-world text data sets, by using four rule induction algorithms: two GAs, namely, BioHEL and OlexGA, and two non-evolutionary algorithms, namely, C4.5 and Ripper. Further, we included in our study linear SVM, as it is reported to be among the best methods for text categorization. Experimental results demonstrate that GAMoN delivers state-of-the-art classification performance, providing a good balance between accuracy and model complexity. Further, they show that GAMoN can scale up to large and realistic real-world domains better than both C4.5 and Ripper.