On the complexity of schema inference from web pages in the presence of nullable data attributes

Authors:
Guizhen Yang;I. V. Ramakrishnan;Michael Kifer
Affiliations:
University at Buffalo, Buffalo, NY;Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY
Venue:
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Year:
2003

Citing 17
Cited 11

Learning regular sets from queries and counterexamples

Information and Computation
On the complexity of learning strings and sequences

Theoretical Computer Science
On finding minimal, maximal, and consistent sequences over a binary alphabet

Theoretical Computer Science
Learning to Understand Information on the Internet: AnExample-Based Approach

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
The Complexity of Some Problems on Subsequences and Supersequences

Journal of the ACM (JACM)
Computational aspects of resilient data extraction from semistructured sources (extended abstract)

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
To Weave the Web

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction

Wrapper induction for information extraction
On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining

On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Curated databases

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Extract knowledge from semi-structured websites for search task simplification

Proceedings of the 20th ACM international conference on Information and knowledge management
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

An increasingly large number of Web pages are machine-generated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a collection of Web pages is important for scalable data extraction, since the inferred schema can be used to automatically identify schema attributes that are "encoded" in Web pages.However, the task of inferring a "good" schema is complicated due to the existence of nullable (missing) data attributes. Usually if an attribute contains a null value, then it will be omitted in the generated Web page, giving rise to different variations and permutations of layout structures in Web pages that are generated from the same template.In this paper we investigate the complexity of schema inference from Web pages in the presence of nullable data attributes. We introduce the notion of unambiguity as a quality measure for inferred schemas and prove that the problem of inferring "good" (unambiguous) schemas is NP-complete. Our complexity results imply that ambiguity resolution is one of the root causes of the computational difficulty underlying schema inference from Web pages.