Post-supervised template induction for dynamic web sources

Authors:
Zhongmin Shi;Evangelos Milios;Nur Zincir-Heywood
Affiliations:
Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada;Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada;Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada
Venue:
AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence
Year:
2003

Citing 8
Cited 0

Introduction to algorithms

Introduction to algorithms
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Elements of the Theory of Computation

Elements of the Theory of Computation
Internet and World Wide Web How to Program

Internet and World Wide Web How to Program
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Learning Stochastic Regular Grammars by Means of a State Merging Method

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Learning the Common Structure of Data

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our system, called TIDE (Template Induction for web Data Extraction), achieves high performance with minimal user input compared to fully supervised techniques.