ALFRED: crowd assisted data extraction

Authors:
Valter Crescenzi;Paolo Merialdo;Disheng Qiu
Affiliations:
Università degli Studi Roma Tre, Rome, Italy;Università degli Studi Roma Tre, Rome, Italy;Università degli Studi Roma Tre, Rome, Italy
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 8
Cited 0

Learning From Noisy Examples

Machine Learning
Queries revisited

Theoretical Computer Science - Special issue: Algorithmic learning theory
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
DIADEM: domain-centric, intelligent, automated data extraction methodology

Proceedings of the 21st international conference companion on World Wide Web
Structural risk minimization over data-dependent hierarchies

IEEE Transactions on Information Theory
An overview of statistical learning theory

IEEE Transactions on Neural Networks
A framework for learning web wrappers from the crowd

Proceedings of the 22nd international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate ALFRED, a wrapper inference system supervised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. ALFRED includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper formalism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccurate answers that can be provided by the workers engaged by crowdsourcing platforms.