Automatic discovery of Web Query Interfaces using machine learning techniques

Authors:
Heidy M. Marin-Castro;Victor J. Sosa-Sosa;Jose F. Martinez-Trinidad;Ivan Lopez-Arevalo
Affiliations:
Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Mexico;Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Mexico;National Institute for Astrophysics, Optics and Electronics Tonantzintla, Puebla, Mexico;Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Mexico
Venue:
Journal of Intelligent Information Systems
Year:
2013

Citing 20
Cited 1

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Machine Learning

Machine Learning
Induction of Decision Trees

Machine Learning
Extension to C-means Algorithm for the Use of Similarity Functions

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Automated discovery of search interfaces on the web

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Combining classifiers to identify online databases

Proceedings of the 16th international conference on World Wide Web
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Learning Deep Web Crawling with Diverse Features

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Estimating deep web data source size by capture---recapture method

Information Retrieval
Mixed data object selection based on clustering and border objects

CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications
Creating and exploring web form repositories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Collaborative identification and annotation of government deep web resources: a hybrid approach

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Domain-oriented Deep Web Data Sources' Discovery and Identification

APWEB '10 Proceedings of the 2010 12th International Asia-Pacific Web Conference
Web database schema identification through simple query interface

RED'09 Proceedings of the 2nd international conference on Resource discovery
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Automatic classification of web databases using domain-dictionaries

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.