Machine Learning
Theoretical Computer Science - Special issue: Algorithmic learning theory
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Applied Artificial Intelligence
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
DIADEM: domain-centric, intelligent, automated data extraction methodology
Proceedings of the 21st international conference companion on World Wide Web
Structural risk minimization over data-dependent hierarchies
IEEE Transactions on Information Theory
An overview of statistical learning theory
IEEE Transactions on Neural Networks
A framework for learning web wrappers from the crowd
Proceedings of the 22nd international conference on World Wide Web
Hi-index | 0.00 |
The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate ALFRED, a wrapper inference system supervised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. ALFRED includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper formalism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccurate answers that can be provided by the workers engaged by crowdsourcing platforms.