A framework for learning web wrappers from the crowd

Authors:
Valter Crescenzi;Paolo Merialdo;Disheng Qiu
Affiliations:
Università Roma Tre, Rome, Italy;Università Roma Tre, Rome, Italy;Università Roma Tre, Rome, Italy
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 13
Cited 2

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Queries revisited

Theoretical Computer Science - Special issue: Algorithmic learning theory
Automatic information extraction from large websites

Journal of the ACM (JACM)
The Lixto data extraction project: back and forth between theory and practice

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Active learning with multiple views

Journal of Artificial Intelligence Research
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
DIADEM: domain-centric, intelligent, automated data extraction methodology

Proceedings of the 21st international conference companion on World Wide Web
Structural risk minimization over data-dependent hierarchies

IEEE Transactions on Information Theory
An overview of statistical learning theory

IEEE Transactions on Neural Networks

ALFRED: crowd assisted data extraction

Proceedings of the 22nd international conference on World Wide Web companion
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.