ALFRED: crowd assisted data extraction

  • Authors:
  • Valter Crescenzi;Paolo Merialdo;Disheng Qiu

  • Affiliations:
  • Università degli Studi Roma Tre, Rome, Italy;Università degli Studi Roma Tre, Rome, Italy;Università degli Studi Roma Tre, Rome, Italy

  • Venue:
  • Proceedings of the 22nd international conference on World Wide Web companion
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate ALFRED, a wrapper inference system supervised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. ALFRED includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper formalism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccurate answers that can be provided by the workers engaged by crowdsourcing platforms.