An unsupervised technique to extract information from semi-structured web pages

  • Authors:
  • Hassan A. Sleiman;Rafael Corchuelo

  • Affiliations:
  • University of Sevilla, Spain;University of Sevilla, Spain

  • Venue:
  • WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.