Computational aspects of resilient data extraction from semistructured sources (extended abstract)

Authors:
Hasan Davulcu;Guizhen Yang;Michael Kifer;I. V. Ramakrishnan
Affiliations:
Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY;Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY;Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY;Department of Computer Science, SUNY at Stony Brook, Stony Brook, NY
Venue:
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2000

Citing 15
Cited 15

A query language and optimization techniques for unstructured data

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Learning to Understand Information on the Internet: AnExample-Based Approach

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Semistructured data

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
An automated approach for retrieving hierarchical data from HTML tables

Proceedings of the eighth international conference on Information and knowledge management
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Elements of the Theory of Computation

Elements of the Theory of Computation
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Learning Syntax by Automata Induction

Machine Learning
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Wrapper Generation for Web Accessible Data Sources

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Finding patterns common to a set of strings (Extended Abstract)

STOC '79 Proceedings of the eleventh annual ACM symposium on Theory of computing

WebViews: accessing personalized web content and services

Proceedings of the 10th international conference on World Wide Web
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto

LPNMR '01 Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic Reasoning
Design and Implementation of the Physical Layer in WebBases: The XRover Experience

CL '00 Proceedings of the First International Conference on Computational Logic
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Semantic bookmarking for non-visual web access

Assets '04 Proceedings of the 6th international ACM SIGACCESS conference on Computers and accessibility
Bio2X: a rule-based approach for semi-automatic transformation of semi-structured biological data to XML

Data & Knowledge Engineering - Special issue: XML schema and data management
Homepage live: automatic block tracing for web personalization

Proceedings of the 16th international conference on World Wide Web
Protection Techniques from Information Extraction

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Automating Navigation Sequences in AJAX Websites

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Web Navigation Sequences Automation in Modern Websites

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Automated browsing in AJAX websites

Data & Knowledge Engineering
Integrating semi-structured data into business applications: a web intelligence example

WM'05 Proceedings of the Third Biennial conference on Professional Knowledge Management
Information extraction for the semantic web

Proceedings of the First international conference on Reasoning Web
Features selection from high-dimensional web data using clustering analysis

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called “shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be “resilient”, i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.