Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Building intelligent web applications using lightweight wrappers
Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Containment and equivalence for an XPath fragment
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
DEByE - Date extraction by example
Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Data extraction from the web based on pre-defined schema
Journal of Computer Science and Technology
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
XPath Containment in the Presence of Disjunction, DTDs, and Variables
ICDT '03 Proceedings of the 9th International Conference on Database Theory
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Unsupervised learning of mDTD extraction patterns for web text mining
Information Processing and Management: an International Journal
Schema-guided wrapper maintenance for web-data extraction
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
DART: a data acquisition and repairing tool
EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Hi-index | 0.00 |
A main challenge in wrapping web data is to make wrappers robust w.r.t. variations in HTML sources, reducing human effort as much as possible. In this paper we develop a new approach to speed up the specification of robust wrappers, allowing the wrapper designer to not care about detailed definition of extraction rules. The key-idea is to enable a schema-based wrapping system to automatically generalize an original wrapper w.r.t. a set of example HTML documents. To accomplish this objective, we propose to exploit the notions of extraction rule and wrapper subsumption for computing a most general wrapper which still shares the extraction schema with the original wrapper, while maximizes the generalization of extraction rules w.r.t. the set of example documents.