Information Retrieval
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Documentum ECI self-repairing wrappers: performance analysis
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Web Information Extraction by HTML Tree Edit Distance Matching
ICCIT '07 Proceedings of the 2007 International Conference on Convergence Information Technology
Domain adaptation of information extraction models
ACM SIGMOD Record
The YAGO-NAGA approach to knowledge discovery
ACM SIGMOD Record
Web-scale extraction of structured data
ACM SIGMOD Record
Robust web extraction: an approach based on a probabilistic tree-edit model
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Scalable web data extraction for online market intelligence
Proceedings of the VLDB Endowment
Data integration for the relational web
Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing robust wrappers for deep web information extraction. In order to keep web extraction robust when webpage changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, we consider three edit operations under structural changes, i.e., inserting nodes, deleting nodes and substituting nodes' labels. Firstly, we obtain the change frequencies of three edit operations for each HTML label according to the frequency of webpage change on real web data with machine learning method. Then, we compute the corresponding edit costs for three edit operations on the basis of change frequencies and minimum cost model. Finally, we choose the most proper data to extract the interested information by applying the minimum cost script. Experimental results show that the proposed approach can accomplish robust web extraction with high accuracy.