Learning robust web wrappers

Authors:
B. Fazzinga;S. Flesca;A. Tagarelli
Affiliations:
DEIS, University of Calabria;DEIS, University of Calabria;DEIS, University of Calabria
Venue:
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Year:
2005

Citing 13
Cited 1

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Containment and equivalence for an XPath fragment

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
DEByE - Date extraction by example

Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Data extraction from the web based on pre-defined schema

Journal of Computer Science and Technology
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
XPath Containment in the Presence of Disjunction, DTDs, and Variables

ICDT '03 Proceedings of the 9th International Conference on Database Theory
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Unsupervised learning of mDTD extraction patterns for web text mining

Information Processing and Management: an International Journal
Schema-guided wrapper maintenance for web-data extraction

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management

DART: a data acquisition and repairing tool

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

A main challenge in wrapping web data is to make wrappers robust w.r.t. variations in HTML sources, reducing human effort as much as possible. In this paper we develop a new approach to speed up the specification of robust wrappers, allowing the wrapper designer to not care about detailed definition of extraction rules. The key-idea is to enable a schema-based wrapping system to automatically generalize an original wrapper w.r.t. a set of example HTML documents. To accomplish this objective, we propose to exploit the notions of extraction rule and wrapper subsumption for computing a most general wrapper which still shares the extraction schema with the original wrapper, while maximizes the generalization of extraction rules w.r.t. the set of example documents.