Robust web data extraction: a novel approach based on minimum cost script edit model

  • Authors:
  • Donglan Liu;Xinjun Wang;Zhongmin Yan;Qiuyan Li

  • Affiliations:
  • School of Computer Science and Technology, Shandong University, Jinan, 1500 Shunhua Road, P.R. China,Shandong Provincial Key Laboratory of Software Engineering, Jinan, P.R. China;School of Computer Science and Technology, Shandong University, Jinan, 1500 Shunhua Road, P.R. China,Shandong Provincial Key Laboratory of Software Engineering, Jinan, P.R. China;School of Computer Science and Technology, Shandong University, Jinan, 1500 Shunhua Road, P.R. China,Shandong Provincial Key Laboratory of Software Engineering, Jinan, P.R. China;Changchun Institute of Engineering Technology, Changchun, P.R. China

  • Venue:
  • WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing robust wrappers for deep web information extraction. In order to keep web extraction robust when webpage changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, we consider three edit operations under structural changes, i.e., inserting nodes, deleting nodes and substituting nodes' labels. Firstly, we obtain the change frequencies of three edit operations for each HTML label according to the frequency of webpage change on real web data with machine learning method. Then, we compute the corresponding edit costs for three edit operations on the basis of change frequencies and minimum cost model. Finally, we choose the most proper data to extract the interested information by applying the minimum cost script. Experimental results show that the proposed approach can accomplish robust web extraction with high accuracy.