Repeated labeling using multiple noisy labelers

Authors:
Panagiotis G. Ipeirotis;Foster Provost;Victor S. Sheng;Jing Wang
Affiliations:
Department of Information, Operations, and Management Sciences, Leonard N. Stern School of Business, New York University, New York, USA;Department of Information, Operations, and Management Sciences, Leonard N. Stern School of Business, New York University, New York, USA;Department of Computer Science, University of Central Arkansas, Conway, USA;Department of Information, Operations, and Management Sciences, Leonard N. Stern School of Business, New York University, New York, USA
Venue:
Data Mining and Knowledge Discovery
Year:
2014

Citing 38
Cited 0

Unanimity and compromise among probability forecasters

Management Science
Learning with an unreliable teacher

Pattern Recognition
C4.5: programs for machine learning

C4.5: programs for machine learning
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Bounds on the mean classification error rate of multiple experts

Pattern Recognition Letters
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Random Forests

Machine Learning
An Instance-Weighting Method to Induce Cost-Sensitive Trees

IEEE Transactions on Knowledge and Data Engineering
Induction of Decision Trees

Machine Learning
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Active Sampling for Class Probability Estimation and Ranking

Machine Learning
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Online Choice of Active Learning Algorithms

The Journal of Machine Learning Research
Active Feature-Value Acquisition for Classifier Induction

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Noisy information value in utility-based decision making

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Toward economic machine learning and utility-based data mining

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
Cost-Constrained Data Acquisition for Intelligent Data Preparation

IEEE Transactions on Knowledge and Data Engineering
An Expected Utility Approach to Active Feature-Value Acquisition

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Selectively Acquiring Customer Information: A New Data Acquisition Problem and an Active Learning-Based Solution

Management Science
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Class Noise Mitigation Through Instance Weighting

ECML '07 Proceedings of the 18th European conference on Machine Learning
Proactive learning: cost-sensitive active learning with multiple imperfect oracles

Proceedings of the 17th ACM conference on Information and knowledge management
Active Feature-Value Acquisition

Management Science
Supervised learning from multiple experts: whom to trust when everyone lies a bit

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Efficiently learning the accuracy of labeling sources for selective sampling

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Financial incentives and the "performance of crowds"

Proceedings of the ACM SIGKDD Workshop on Human Computation
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Active cost-sensitive learning

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Ensemble methods for noise elimination in classification problems

MCS'03 Proceedings of the 4th international conference on Multiple classifier systems
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Learning From Crowds

The Journal of Machine Learning Research
Budgeted learning of nailve-bayes classifiers

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Learning and classifying under hard budgets

ECML'05 Proceedings of the 16th European conference on Machine Learning
Some asymptotic properties of the probabilistic teacher (Corresp.)

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction of predictive models. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.