Automatic generation of data types for classification of deep web sources

Authors:
Anne H. H. Ngu;David Buttler;Terence Critchlow
Affiliations:
Department of Computer Science, Texas State University, San Marcos, TX;Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA;Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA
Venue:
DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
Year:
2005

Citing 7
Cited 0

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces

World Wide Web
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Quantified Score

Hi-index	0.00

Visualization

Abstract

A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.