A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Bootstrapping for example-based data extraction
Proceedings of the tenth international conference on Information and knowledge management
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
Multistrategy Learning for Information Extraction
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Measures of distributional similarity
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Issues in stacked generalization
Journal of Artificial Intelligence Research
Synthesizing products for online catalogs
Proceedings of the VLDB Endowment
From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information extraction from semi-structured resources: a two-phase finite state transducers approach
CIAA'11 Proceedings of the 16th international conference on Implementation and application of automata
TEX: An efficient and effective unsupervised Web information extractor
Knowledge-Based Systems
Unsupervised wrapper induction using linked data
Proceedings of the seventh international conference on Knowledge capture
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
Hi-index | 0.00 |
We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2---5 pages for 4---6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.