Facilitating wrapper generation with page analysis

  • Authors:
  • Bo Wu;Xueqi Cheng;Yu Wang;Gang Zhang;Guodong Ding

  • Affiliations:
  • Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China

  • Venue:
  • ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current approaches for generating wrappers for web page extraction suffer from the requirement of huge amount of labeled training pages to obtain satisfying results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manually labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.