A query language and optimization techniques for unstructured data
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Template-based wrappers in the TSIMMIS system
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Learning to Understand Information on the Internet: AnExample-Based Approach
Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting semi-structured data through examples
Proceedings of the eighth international conference on Information and knowledge management
An automated approach for retrieving hierarchical data from HTML tables
Proceedings of the eighth international conference on Information and knowledge management
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Elements of the Theory of Computation
Elements of the Theory of Computation
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
Learning Syntax by Automata Induction
Machine Learning
ICDT '97 Proceedings of the 6th International Conference on Database Theory
Wrapper Generation for Web Accessible Data Sources
COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Finding patterns common to a set of strings (Extended Abstract)
STOC '79 Proceedings of the eleventh annual ACM symposium on Theory of computing
WebViews: accessing personalized web content and services
Proceedings of the 10th international conference on World Wide Web
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto
LPNMR '01 Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic Reasoning
Design and Implementation of the Physical Layer in WebBases: The XRover Experience
CL '00 Proceedings of the First International Conference on Computational Logic
On the complexity of schema inference from web pages in the presence of nullable data attributes
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Semantic bookmarking for non-visual web access
Assets '04 Proceedings of the 6th international ACM SIGACCESS conference on Computers and accessibility
Data & Knowledge Engineering - Special issue: XML schema and data management
Homepage live: automatic block tracing for web personalization
Proceedings of the 16th international conference on World Wide Web
Protection Techniques from Information Extraction
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Automating Navigation Sequences in AJAX Websites
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Web Navigation Sequences Automation in Modern Websites
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Automated browsing in AJAX websites
Data & Knowledge Engineering
Integrating semi-structured data into business applications: a web intelligence example
WM'05 Proceedings of the Third Biennial conference on Professional Knowledge Management
Information extraction for the semantic web
Proceedings of the First international conference on Reasoning Web
Features selection from high-dimensional web data using clustering analysis
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
Automatic data extraction from semistructured sources such as HTML pages is rapidly growing into a problem of significant importance, spurred by the growing popularity of the so called “shopbots” that enable end users to compare prices of goods and other services at various web sites without having to manually browse and fill out forms at each one of these sites.The main problem one has to contend with when designing data extraction techniques is that the contents of a web page changes frequently, either because its data is generated dynamically, in response to filling out a form, or because of changes to its presentation format. This makes the problem of data extraction particularly challenging, since a desirable requirement of any data extraction technique is that it be “resilient”, i.e., using it we should always be able to locate the object of interest in a page (such as a form or an element in a table generated by a form fill-out) in spite of changes to the page's ntent and layout.In this paper we propose a formal computation model for developing resilient data extraction techniques from semistructured sources. Specifically we formalize the problem of data extraction as one of generating unambiguous extraction expressions, which are regular expressions with some additional structure. The problem of resilience is then formalized as one of generating a maximal extraction expression of this kind. We present characterization theorems for maximal extraction expressions, complexity results for testing them, and algorithms for synthesizing them.