A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Machine Learning
Using Grammatical Inference to Automate Information Extraction from the Web
PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Wrapper induction for information extraction
Wrapper induction for information extraction
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Hi-index | 0.00 |
This paper introduces a new algorithm that learns to extract data from Web pages with relatively regular data structures. Current existing systems require training on either manually labelled pages or at least two similar unlabelled pages, and they often have difficulties on handling Web pages with complex formats such as nested tables or lists. Our previous system AutoWrapper does not need any training and can automatically extract data from any single page. This paper improves AutoWrapper by handling nested structures and finding multiple regular data areas. The main contributions include a tree-based representation for Web pages, an expressive language for representing information extraction patterns, and a learning algorithm that automatically detects regular data areas by finding similar sub-trees.