Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
ACM SIGMOD Record
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation
SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
Mapping maintenance for data integration systems
VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems
Theoretical Computer Science
Proceedings of the 15th international conference on World Wide Web
Documentum ECI self-repairing wrappers: performance analysis
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MyPortal: robust extraction and aggregation of web content
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Computational Linguistics
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
Learning stochastic tree edit distance
ECML'06 Proceedings of the 17th European conference on Machine Learning
An overview of probabilistic tree transducers for natural language processing
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Building ranked mashups of unstructured sources with uncertain information
Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
Automated browsing in AJAX websites
Data & Knowledge Engineering
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
SCAD: collective discovery of attribute values
Proceedings of the 20th international conference on World wide web
Automatically learning gazetteers from the deep web
Proceedings of the 21st international conference companion on World Wide Web
WebSelF: a web scraping framework
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Robust web data extraction: a novel approach based on minimum cost script edit model
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Towards web-scale structured web data extraction
Proceedings of the sixth ACM international conference on Web search and data mining
Unsupervised wrapper induction using linked data
Proceedings of the seventh international conference on Knowledge capture
Leveraging spatial join for robust tuple extraction from web pages
Information Sciences: an International Journal
Hi-index | 0.00 |
On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.