Learning from little: comparison of classifiers given little training

Authors:
George Forman;Ira Cohen
Affiliations:
Hewlett-Packard Research Laboratories, 1501 Page Mill Rd., Palo Alto, CA;Hewlett-Packard Research Laboratories, 1501 Page Mill Rd., Palo Alto, CA
Venue:
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Year:
2004

Citing 0
Cited 17

Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Pragmatic text mining: minimizing human effort to quantify many issues in call logs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Anomaly-based fault detection in pervasive computing system

Proceedings of the 5th international conference on Pervasive services
On the Effects of Learning Set Corruption in Anomaly-Based Detection of Web Defacements

DIMVA '07 Proceedings of the 4th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Case-Sensitivity of Classifiers for WSD: Complex Systems Disambiguate Tough Words Better

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
A fast decision tree learning algorithm

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
OE: WSD using optimal ensembling (OE) method

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Original paper: Supervised machine learning and heterotic classification of maize (Zea mays L.) using molecular marker data

Computers and Electronics in Agriculture
Defining classifier regions for WSD ensembles using word space features

MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
A propositional approach to textual case indexing

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Cost-sensitive classification with inadequate labeled data

Information Systems
Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

The Journal of Supercomputing
Learning to classify service data with latent semantics

RSKT'12 Proceedings of the 7th international conference on Rough Sets and Knowledge Technology
RssE-Miner: a new approach for efficient events mining from social media RSS feeds

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Discrete-Time hopfield neural network based text clustering algorithm

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Feature words that classify problem sentence in scientific article

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many real-world machine learning tasks are faced with the problem of small training sets. Additionally, the class distribution of the training set often does not match the target distribution. In this paper we compare the performance of many learning models on a substantial benchmark of binary text classification tasks having small training sets. We vary the training size and class distribution to examine the learning surface, as opposed to the traditional learning curve. The models tested include various feature selection methods each coupled with four learning algorithms: Support Vector Machines (SVM), Logistic Regression, Naive Bayes, and Multinomial Naive Bayes. Different models excel in different regions of the learning surface, leading to meta-knowledge about which to apply in different situations. This helps guide the researcher and practitioner when facing choices of model and feature selection methods in, for example, information retrieval settings and others.