Irrelevant attributes and imbalanced classes in multi-label text-categorization domains

  • Authors:
  • Sareewan Dendamrongvit;Peerapon Vateekul;Miroslav Kubat

  • Affiliations:
  • Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, USA;Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, USA;Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, USA

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

An interesting issue in machine learning is induction in multi-label domains where each example can be labeled with two or more classes at the same time. In a work focusing on text categorization, we followed the most commonly used approach and induced a binary classifier for each class. Analyzing the results, we noticed that performance had been impaired by two factors. First, in text domains, each class is characterized by a different set of attributes; an appropriate attribute-selection technique thus has to be applied separately to each of them. Second, the individual classes often have to be induced from imbalanced training sets, a circumstance we addressed here by majority-class undersampling. The paper provides details of the induction system and reports the results of systematic experimentation.