Learning regular sets from queries and counterexamples
Information and Computation
On the complexity of learning strings and sequences
Theoretical Computer Science
On finding minimal, maximal, and consistent sequences over a binary alphabet
Theoretical Computer Science
Learning to Understand Information on the Internet: AnExample-Based Approach
Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Wrapper generation for semi-structured Internet sources
ACM SIGMOD Record
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
The Complexity of Some Problems on Subsequences and Supersequences
Journal of the ACM (JACM)
Computational aspects of resilient data extraction from semistructured sources (extended abstract)
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Machine Learning
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
Extract knowledge from semi-structured websites for search task simplification
Proceedings of the 20th ACM international conference on Information and knowledge management
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
Hi-index | 0.01 |
An increasingly large number of Web pages are machine-generated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a collection of Web pages is important for scalable data extraction, since the inferred schema can be used to automatically identify schema attributes that are "encoded" in Web pages.However, the task of inferring a "good" schema is complicated due to the existence of nullable (missing) data attributes. Usually if an attribute contains a null value, then it will be omitted in the generated Web page, giving rise to different variations and permutations of layout structures in Web pages that are generated from the same template.In this paper we investigate the complexity of schema inference from Web pages in the presence of nullable data attributes. We introduce the notion of unambiguity as a quality measure for inferred schemas and prove that the problem of inferring "good" (unambiguous) schemas is NP-complete. Our complexity results imply that ambiguity resolution is one of the root causes of the computational difficulty underlying schema inference from Web pages.