Automatic data record detection in web pages

  • Authors:
  • Xiaoying Gao;Le Phong Bao Vuong;Mengjie Zhang

  • Affiliations:
  • School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand

  • Venue:
  • KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Wrapper induction is currently the main technology for data extraction from semi-structured web pages. However, wrapper induction has the limitation of requiring training Web pages, and the information extraction process is quite complex involving pattern induction, data extraction and data transformation. This paper introduces a new approach that achieves automatic data extraction by applying clustering to detecting similar text tokens, developing a new method to label text tokens to capture the hierarchical structure of HTML pages, and developing an algorithm for transforming labelled text tokens to XML. The approach is examined and compared with a number of existing wrapper induction systems on three different sets of web pages. The results suggest that the new approach is effective for data extraction and that it outperforms existing approaches on these web sites. This approach has the advantages of requiring no training and has no explicit processes for pattern induction or data extraction, therefore the whole process has been simplified.