Open data acquisition: theory and experiments

  • Authors:
  • David G. Stork;Chuck P. Lam

  • Affiliations:
  • Stanford University;Stanford University

  • Venue:
  • Open data acquisition: theory and experiments
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Decades of research have pointed to the need for large datasets in building intelligent systems. While recent improvements in storage and other technologies have led to an explosion of available data, these raw data are often inadequate by themselves for machine learning, and additional human-provided information is necessary. For example, in natural language processing, optical character recognition, and speech recognition, raw data can be easily collected, but they need to be manually labeled in order to train the pattern classifiers. Unfortunately, acquisition of such manual data has not scaled at the exponential rate of information technologies, and it has become the economic bottleneck in building many pattern recognition systems. Open data acquisition attempts to scale up the collection of such manual data by leveraging volunteers openly through the Web. This dissertation addresses the challenges of open data acquisition from such non-expert contributors. We first address the oft-overlooked issue of evaluating classifiers using noisy labels, and we derive tight upper and lower bounds to the true error rate under mild assumptions. In analyzing the training and evaluation of classifiers using noisy labels, we found that for many problem domains the larger volume of data can compensate for the increase in labeling noise. Specifically, we found that challenging problem domains with naturally medium to high error rates can benefit the most from open data acquisition. Open data acquisition can be even more effective by optimizing its labeling strategy in the same spirit as active learning. We find that the information-theoretic optimal labeling strategy is to sometimes leave some data samples unlabeled yet have some data samples labeled by multiple contributors. The optimal labeling strategy also tends to assign difficult samples to more skillful labelers while easier samples to less skillful labelers. By leveraging the power of ordinary people through the Web, open data acquisition is a promising approach to alleviate the data bottleneck in building intelligent machines.