Weighted average pointwise mutual information for feature selection in text categorization

Authors:
Karl-Michael Schneider
Affiliations:
Department of General Linguistics, University of Passau, Passau, Germany
Venue:
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Year:
2005

Citing 9
Cited 3

Word association norms, mutual information, and lexicography

Computational Linguistics
Elements of information theory

Elements of information theory
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
A vector space model for automatic indexing

Communications of the ACM
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

Data Mining and Knowledge Discovery
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research

A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
A term association translation model for naive bayes text classification

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Fast dimension reduction for document classification based on imprecise spectrum analysis

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mutual information is a common feature score in feature selection for text categorization. Mutual information suffers from two theoretical problems: It assumes independent word variables, and longer documents are given higher weights in the estimation of the feature scores, which is in contrast to common evaluation measures that do not distinguish between long and short documents. We propose a variant of mutual information, called Weighted Average Pointwise Mutual Information (WAPMI) that avoids both problems. We provide theoretical as well as extensive empirical evidence in favor of WAPMI. Furthermore, we show that WAPMI has a nice property that other feature metrics lack, namely it allows to select the best feature set size automatically by maximizing an objective function, which can be done using a simple heuristic without resorting to costly methods like EM and model selection.