Template-based wrappers in the TSIMMIS system

Authors:
Joachim Hammer;Héctor García-Molina;Svetlozar Nestorov;Ramana Yerneni;Marcus Breunig;Vasilis Vassalos
Affiliations:
Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA
Venue:
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Year:
1997

Citing 3
Cited 64

A Query Translation Scheme for Rapid Implementation of Wrappers

DOOD '95 Proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Object Fusion in Mediator Systems

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Managing semantic heterogeneity in databases: a theoretical prospective

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Capability based mediation in TSIMMIS

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Modeling Web sources for information integration

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
An XJML-based wrapper generator for Web information extraction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
Computational aspects of resilient data extraction from semistructured sources (extended abstract)

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Information Systems that Really Support Decision-Making

Journal of Intelligent Information Systems - Special issue on methodologies for intelligent information systems
OBSERVER: An Approach for Query Processing in Global Information Systems Based on Interoperation Across Pre-Existing Ontologies

Distributed and Parallel Databases
Answering queries with useful bindings

ACM Transactions on Database Systems (TODS)
A brief survey of web data extraction tools

ACM SIGMOD Record
Merging structured text using temporal knowledge

Data & Knowledge Engineering
Beyond Schema Versioning: A Flexible Model for Spatio-Temporal Schema Selection

Geoinformatica
Logical fusion rules for merging structured news reports

Data & Knowledge Engineering
DEByE - Date extraction by example

Data & Knowledge Engineering
Wrapping web data into XML

ACM SIGMOD Record
Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Machine Learning
Integrating Knowledge on the Web

IEEE Internet Computing
Managing Web-Based Data: Database Models and Transformations

IEEE Internet Computing
Data extraction from the web based on pre-defined schema

Journal of Computer Science and Technology
Information Systems That also Project into the Future

DNIS '02 Proceedings of the Second International Workshop on Databases in Networked Information Systems
Modeling Information Sources for Information Integration

EKAW '99 Proceedings of the 11th European Workshop on Knowledge Acquisition, Modeling and Management
Optimizing Large Join Queries in Mediation Systems

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Toward Learning Based Web Query Processing

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Describing and Using Query Capabilities of Heterogeneous Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Semantic Integration and Querying of Heterogeneous Data Sources Using a Hypergraph Data Model

BNCOD 19 Proceedings of the 19th British National Conference on Databases: Advances in Databases
The Design and Implementation of Modularized Wrappers/ Monitors in a Data Warehouse

DaWaK '99 Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery
Schema Evolution in Heterogeneous Database Architectures, A Schema Transformation Approach

CAiSE '02 Proceedings of the 14th International Conference on Advanced Information Systems Engineering
A Knowledge-Based Information Extraction System for Semi-structured Labeled Documents

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
A Shopping Agent That Automatically Constructs Wrappers for Semi-Structured Online Vendors

IDEAL '00 Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents
Implementing Powerful Retrieval Capabilities in a Distributed Environment for Libraries and Archives

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Databases and the World Wide Web

SOFSEM '99 Proceedings of the 26th Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice of Informatics
Wrapper Generation by Using XML-Based Domain Knowledge for Intelligent Information Extraction

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
An Example-Based Environment for Wrapper Generation

ER '00 Proceedings of the Workshops on Conceptual Modeling Approaches for E-Business and The World Wide Web and Conceptual Modeling: Conceptual Modeling for E-Business and the Web
SPICE: A Flexible Architecture for Integrating Autonomous Databases to Comprise a Distributed Catalogue of Life

DEXA '00 Proceedings of the 11th International Conference on Database and Expert Systems Applications
Semi-automatic wrapper generation and adaption: living with heterogeneity in a market environment

Enterprise information systems IV
A semi-universal e-commerce agent: domain-dependant information gathering

Enterprise information systems IV
On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Schema-guided wrapper maintenance for web-data extraction

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
How to make web sites talk together: web service solution

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML

Data & Knowledge Engineering
Query reformulation for an XML-based data integration system

Proceedings of the 2006 ACM symposium on Applied computing
Fusion rules for merging uncertain information

Information Fusion
A knowledge-based approach to merging information

Knowledge-Based Systems
Ontology-supported FAQ processing and ranking techniques

Journal of Intelligent Information Systems
Data Extraction From Repositories On The Web: A Semi-Automatic Approach

Journal of Integrated Design & Process Science
Challenges, approaches and architecture for distributed process integration in heterogeneous environments

Advanced Engineering Informatics
OntoMiner: automated metadata and instance mining from news websites

International Journal of Web and Grid Services
Automated Semantic Analysis of Schematic Data

World Wide Web
The Harmony Integration Workbench

Journal on Data Semantics XI
Semantic-based Merging of RSS Items

World Wide Web
Automatic generation of wrapper for data extraction from the web

ICWE'03 Proceedings of the 2003 international conference on Web engineering
Flexible reuse of middleware infrastructures in heterogeneous IT environments

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
Scalable knowledge extraction from legacy sources with SEEK

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Combining artificial intelligence and databases for data integration

Artificial intelligence today
From SEEKing knowledge to making connections: challenges, approaches and architectures for distributed process integration

EG-ICE'06 Proceedings of the 13th international conference on Intelligent Computing in Engineering and Architecture
Reduce, reuse, recycle: practical approaches to schema integration, evolution and versioning

CoMoGIS'06 Proceedings of the 2006 international conference on Advances in Conceptual Modeling: theory and practice
PIES: a web information extraction system using ontology and tag patterns

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
An algorithm of online goods information extraction with two-stage working pattern

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
An interface agent for wrapper-based information extraction

PRIMA'04 Proceedings of the 7th Pacific Rim international conference on Intelligent Agents and Multi-Agent Systems
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

In order to access information from a variety of heterogeneous information sources, one has to be able to translate queries and data from one data model into another. This functionality is provided by so-called (source) wrappers [4,8] which convert queries into one or more commands/queries understandable by the underlying source and transform the native results into a format understood by the application. As part of the TSIMMIS project [1, 6] we have developed hard-coded wrappers for a variety of sources (e.g., Sybase DBMS, WWW pages, etc.) including legacy systems (Folio). However, anyone who has built a wrapper before can attest that a lot of effort goes into developing and writing such a wrapper. In situations where it is important or desirable to gain access to new sources quickly, this is a major drawback. Furthermore, we have also observed that only a relatively small part of the code deals with the specific access details of the source. The rest of the code is either common among wrappers or implements query and data transformation that could be expressed in a high level, declarative fashion.Based on these observations, we have developed a wrapper implementation toolkit [7] for quickly building wrappers. The toolkit contains a library for commonly used functions, such as for receiving queries from the application and packaging results. It also contains a facility for translating queries into source-specific commands, and for translating results into a model useful to the application. The philosophy behind our “template-based” translation methodology is as follows. The wrapper implementor specifies a set of templates (rules) written in a high level declarative language that describe the queries accepted by the wrapper as well as the objects that it returns. If an application query matches a template, an implementor-provided action associated with the template is executed to provide the native query for the underlying source1. When the source returns the result of the query, the wrapper transforms the answer which is represented in the data model of the source into a representation that is used by the application. Using this toolkit one can quickly design a simple wrapper with a few templates that cover some of the desired functionality, probably the one that is most urgently needed. However, templates can be added gradually as more functionality is required later on.Another important use of wrappers is in extending the query capabilities of a source. For instance, some sources may not be capable of answering queries that have multiple predicates. In such cases, it is necessary to pose a native query to such a source using only predicates that the source is capable of handling. The rest of the predicates are automatically separated from the user query and form a filter query. When the wrapper receives the results, a post-processing engine applies the filter query. This engine supports a set of built-in predicates based on the comparison operators =,≠,, etc. In addition, the engine supports more complex predicates that can be specified as part of the filter query. The postprocessing engine is common to wrappers of all sources and is part of the wrapper toolkit. Note that because of postprocessing, the wrapper can handle a much larger class of queries than those that exactly match the templates it has been given. Figure 1 shows an overview of the wrapper architecture as it is currently implemented in our TSIMMIS testbed. Shaded components are provided by the toolkit, the white component is source-specific and must be generated by the implementor. The driver component controls the translation process and invokes the following services: the parser which parses the templates, the native schema, as well as the incoming queries into internal data structures, the matcher which matches a query against the set of templates and creates a filter query for postprocessing if necessary, the native component which submits the generated action string to the source, and extracts the data from the native result using the information given in the source schema, and the engine, which transforms and packages the result and applies a postprocessing filter if one has been created by the matcher. We now describe the sequence of events that occur at the wrapper during the translation of a query and its result using an example from our prototype system. The queries are formulated using a rule-based language called MSL that has been developed as a template specification and query language for the TSIMMIS project. Data is represented using our Object Exchange Model (OEM). We will briefly describe MSL and OEM in the next section. Details on MSL can be found in [5], a full introduction to OEM is given in [1].