Automatic generation of data types for classification of deep web sources

  • Authors:
  • Anne H. H. Ngu;David Buttler;Terence Critchlow

  • Affiliations:
  • Department of Computer Science, Texas State University, San Marcos, TX;Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA;Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA

  • Venue:
  • DILS'05 Proceedings of the Second international conference on Data Integration in the Life Sciences
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A Service Class Description (SCD) is an effective meta-data based approach for discovering Deep Web sources whose data exhibit some regular patterns. However, it is tedious and error prone to create an SCD description manually. Moreover, a manually created SCD is not adaptive to the frequent changes of Web sources. It requires its creator to identify all the possible input and output types of a service a priori. In many domains, it is impossible to exhaustively list all the possible input and output data types of a source in advance. In this paper, we describe machine learning approaches for automatic generation of the data types of an SCD. We propose two different approaches for learning data types of a class of Web sources. The Brute-Force Learner is able to generate data types that can achieve high recall, but with low precision. The Clustering-based Learner generates data types that have a high precision rate, but with a lower recall rate. We demonstrate the feasibility of these two learning-based solutions for automatic generation of data types for citation Web sources and presented a quantitative evaluation of these two solutions.