Robust web extraction: an approach based on a probabilistic tree-edit model

Authors:
Nilesh Dalvi;Philip Bohannon;Fei Sha
Affiliations:
Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;University of Southern California, Los Angeles, CA, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 19
Cited 12

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Wrapping web data into XML

ACM SIGMOD Record
Learning String Edit Distance

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation

SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
Mapping maintenance for data integration systems

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A survey on tree edit distance and related problems

Theoretical Computer Science
Robust web content extraction

Proceedings of the 15th international conference on World Wide Web
Documentum ECI self-repairing wrappers: performance analysis

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MyPortal: robust extraction and aggregation of web content

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Learning stochastic edit distance: Application in handwritten character recognition

Pattern Recognition
Training tree transducers

Computational Linguistics
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Learning stochastic tree edit distance

ECML'06 Proceedings of the 17th European conference on Machine Learning
An overview of probabilistic tree transducers for natural language processing

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Building ranked mashups of unstructured sources with uncertain information

Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Automated browsing in AJAX websites

Data & Knowledge Engineering
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
SCAD: collective discovery of attribute values

Proceedings of the 20th international conference on World wide web
Automatically learning gazetteers from the deep web

Proceedings of the 21st international conference companion on World Wide Web
WebSelF: a web scraping framework

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Robust web data extraction: a novel approach based on minimum cost script edit model

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently--in quadratic time in the size of the trees--making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.