Using clustering for web information extraction

  • Authors:
  • Le Phong Bao Vuong;Xiaoying Gao

  • Affiliations:
  • School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand

  • Venue:
  • AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper introduces an approach that achieves automated data extraction from semi-structured Web pages by clustering. Both HTML tags and the textual features of text tokens are considered for similarity comparison. The first clustering process groups similar text tokens into the same text clusters, and the second clustering process groups similar data tuples into tuple clusters. A tuple cluster is a strong candidate of a repetitive data region.