Irrelevant attributes and imbalanced classes in multi-label text-categorization domains

Authors:
Sareewan Dendamrongvit;Peerapon Vateekul;Miroslav Kubat
Affiliations:
Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, USA;Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, USA;Department of Electrical & Computer Engineering, University of Miami, Coral Gables, FL, USA
Venue:
Intelligent Data Analysis
Year:
2011

Citing 23
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
The nature of statistical learning theory

The nature of statistical learning theory
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Information Retrieval

Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Knowledge Discovery in Multi-label Phenotype Data

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
An introduction to variable and feature selection

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing (TALIP)
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Exploratory Under-Sampling for Class-Imbalance Learning

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
The class imbalance problem: A systematic study

Intelligent Data Analysis
Combining Subclassifiers in Text Categorization: A DST-Based Solution and a Case Study

IEEE Transactions on Knowledge and Data Engineering
A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling

MCS '09 Proceedings of the 8th International Workshop on Multiple Classifier Systems
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Fast Induction of Multiple Decision Trees in Text Categorization from Large Scale, Imbalanced, and Multi-label Data

ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

An interesting issue in machine learning is induction in multi-label domains where each example can be labeled with two or more classes at the same time. In a work focusing on text categorization, we followed the most commonly used approach and induced a binary classifier for each class. Analyzing the results, we noticed that performance had been impaired by two factors. First, in text domains, each class is characterized by a different set of attributes; an appropriate attribute-selection technique thus has to be applied separately to each of them. Second, the individual classes often have to be induced from imbalanced training sets, a circumstance we addressed here by majority-class undersampling. The paper provides details of the induction system and reports the results of systematic experimentation.