IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Theoretical Computer Science - Special issue: Algorithmic learning theory
Automatic information extraction from large websites
Journal of the ACM (JACM)
The Lixto data extraction project: back and forth between theory and practice
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Interactive wrapper generation with minimal user effort
Proceedings of the 15th international conference on World Wide Web
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Applied Artificial Intelligence
Active learning with multiple views
Journal of Artificial Intelligence Research
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
DIADEM: domain-centric, intelligent, automated data extraction methodology
Proceedings of the 21st international conference companion on World Wide Web
Structural risk minimization over data-dependent hierarchies
IEEE Transactions on Information Theory
An overview of statistical learning theory
IEEE Transactions on Neural Networks
ALFRED: crowd assisted data extraction
Proceedings of the 22nd international conference on World Wide Web companion
The ontological key: automatically understanding and integrating forms to access the deep Web
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.