Automatic classification of web databases using domain-dictionaries

  • Authors:
  • Heidy M. Marin-Castro;Victor J. Sosa-Sosa;Ivan Lopez-Arevalo;Hugo Jair Escalante-Baldera

  • Affiliations:
  • Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Tamaulipas, Mexico;Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Tamaulipas, Mexico;Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Tamaulipas, Mexico;National Institute for Astrophysics, Optics and Electronics, Tonantzintla, Puebla, Mexico

  • Venue:
  • MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The identification, classification and integration of databases on the Web (also called web databases) as information sources is still a great challenge due to their constantly growing and diversification. The classification of such web databases according to their application domain is an important step towards the integration of deep web sources. Moreover, given the design and content heterogeneity that exists among the different web databases, their automatic classification become a great challenge and a highly demanded task, requiring techniques that allow to cluster web databases according to the domains they belong to. In this paper we present a strategy for automatic classification of web databases based on a new supervised approach. This strategy uses the visible information available on a group of specific-domain Web Query Interfaces (WQIs) to construct a dictionary or lexicon that will allow to better describe a particular domain of interest. The dictionary is enriched with synonyms. In our experiments, the dictionary was built from a set of randomly selected specific-domain WQIs. The automatic WQI classification based on dictionaries generated in this way showed efficient and competitive results compared against related work.