TreeWrapper: automatic data extraction based on tree representation

  • Authors:
  • Xiaoying Gao;Mengjie Zhang;Minh Duc Cao

  • Affiliations:
  • School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand

  • Venue:
  • AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a new algorithm that learns to extract data from Web pages with relatively regular data structures. Current existing systems require training on either manually labelled pages or at least two similar unlabelled pages, and they often have difficulties on handling Web pages with complex formats such as nested tables or lists. Our previous system AutoWrapper does not need any training and can automatically extract data from any single page. This paper improves AutoWrapper by handling nested structures and finding multiple regular data areas. The main contributions include a tree-based representation for Web pages, an expressive language for representing information extraction patterns, and a learning algorithm that automatically detects regular data areas by finding similar sub-trees.