Bias Analysis in Text Classification for Highly Skewed Data

Authors:
Lei Tang;Huan Liu
Affiliations:
Arizona State University;Arizona State University
Venue:
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Year:
2005

Citing 5
Cited 12

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research

Blocking objectionable web content by leveraging multiple information sources

ACM SIGKDD Explorations Newsletter
Acclimatizing Taxonomic Semantics for Hierarchical Content Classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic taxonomy adaptation for group profiling

ACM Transactions on Knowledge Discovery from Data (TKDD)
Large scale multi-label classification via metalabeler

Proceedings of the 18th international conference on World wide web
Specializing for predicting obesity and its co-morbidities

Journal of Biomedical Informatics
Distinctive characteristics of a metric using deviations from Poisson for feature selection

Expert Systems with Applications: An International Journal
Sentence-level event classification in unstructured texts

Information Retrieval
VQSVM: A case study for incorporating prior domain knowledge into inductive machine learning

Neurocomputing
Comparison of metrics for feature selection in imbalanced text classification

Expert Systems with Applications: An International Journal
Group Profiling for Understanding Social Structures

ACM Transactions on Intelligent Systems and Technology (TIST)
Evaluation of the importance of data pre-processing order when combining feature selection and data sampling

International Journal of Business Intelligence and Data Mining
A comparative study on feature selection and adaptive strategies for email foldering using the ABC-DynF framework

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection is often applied to high-dimensional data as a preprocessing step in text classification. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Na篓ýve Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classificationperformance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.