Boosting a weak learning algorithm by majority
COLT '90 Proceedings of the third annual workshop on Computational learning theory
Improving Generalization with Active Learning
Machine Learning - Special issue on structured connectionist systems
CYC: a large-scale investment in knowledge infrastructure
Communications of the ACM
Advances in knowledge discovery and data mining
Advances in knowledge discovery and data mining
Large-Scale Simulation Studies in Image Pattern Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Machine Learning
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality
Data Mining and Knowledge Discovery
WETICE '01 Proceedings of the 10th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Open data acquisition: theory and experiments
Open data acquisition: theory and experiments
Empirical analysis of predictive algorithms for collaborative filtering
UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
IEEE Transactions on Information Theory - Part 2
Open Mind Common Sense: Knowledge Acquisition from the General Public
On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002
Hi-index | 0.00 |
The creation of a pattern classifier requires choosing or creating a model, collecting training data and verifying or "truthing" this data, and then training and testing the classifier. In practice, individual steps in this sequence must be repeated a number of times before the classifier achieves acceptable performance. The majority of the research in computational learning theory addresses the issues associated with training the classifier (learnability, convergence times, generalization bounds, etc.). While there has been modest research effort on topics such as cost-based collection of data in the context of a particular classifier model, there remain numerous unsolved problems of practical importance associated with the collection and truthing of data. Many of these can be addressed with the formal methods of computational learning theory. A number of these issues, as well as new ones -- such as the identification of "hostile" contributors and their data -- are brought to light by the Open Mind Initiative, where data is openly contributed over the World Wide Web by non-experts of varying reliabilities. This paper states generalizations of formal results on the relative value of labeled and unlabeled data to the realistic case where a labeler is not a foolproof oracle but is instead somewhat unreliable and error-prone. It also summarizes formal results on strategies for presenting data to labelers of known reliability in order to obtain best estimates of model parameters. It concludes with a call for a rich, powerful and practical computational theory of data acquisition and truthing, built upon the concepts and techniques developed for studying general learning systems.