Post-supervised template induction for dynamic web sources

  • Authors:
  • Zhongmin Shi;Evangelos Milios;Nur Zincir-Heywood

  • Affiliations:
  • Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada;Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada;Faculty of Computer Science, Dalhousie University, Halifax, N.S., Canada

  • Venue:
  • AI'03 Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in artificial intelligence
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our system, called TIDE (Template Induction for web Data Extraction), achieves high performance with minimal user input compared to fully supervised techniques.