Robust web data extraction: a novel approach based on minimum cost script edit model

Authors:
Donglan Liu;Xinjun Wang;Zhongmin Yan;Qiuyan Li
Affiliations:
School of Computer Science and Technology, Shandong University, Jinan, 1500 Shunhua Road, P.R. China,Shandong Provincial Key Laboratory of Software Engineering, Jinan, P.R. China;School of Computer Science and Technology, Shandong University, Jinan, 1500 Shunhua Road, P.R. China,Shandong Provincial Key Laboratory of Software Engineering, Jinan, P.R. China;School of Computer Science and Technology, Shandong University, Jinan, 1500 Shunhua Road, P.R. China,Shandong Provincial Key Laboratory of Software Engineering, Jinan, P.R. China;Changchun Institute of Engineering Technology, Changchun, P.R. China
Venue:
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Year:
2012

Citing 13
Cited 0

Information Retrieval

Information Retrieval
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Documentum ECI self-repairing wrappers: performance analysis

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Web Information Extraction by HTML Tree Edit Distance Matching

ICCIT '07 Proceedings of the 2007 International Conference on Convergence Information Technology
Domain adaptation of information extraction models

ACM SIGMOD Record
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
Web-scale extraction of structured data

ACM SIGMOD Record
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Scalable web data extraction for online market intelligence

Proceedings of the VLDB Endowment
Data integration for the relational web

Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing robust wrappers for deep web information extraction. In order to keep web extraction robust when webpage changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, we consider three edit operations under structural changes, i.e., inserting nodes, deleting nodes and substituting nodes' labels. Firstly, we obtain the change frequencies of three edit operations for each HTML label according to the frequency of webpage change on real web data with machine learning method. Then, we compute the corresponding edit costs for three edit operations on the basis of change frequencies and minimum cost model. Finally, we choose the most proper data to extract the interested information by applying the minimum cost script. Experimental results show that the proposed approach can accomplish robust web extraction with high accuracy.