Combining classifiers to identify online databases

Authors:
Luciano Barbosa;Juliana Freire
Affiliations:
University of Utah, Salt Lake City, UT;University of Utah, Salt Lake City, UT
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 20
Cited 20

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Probabilistic combination of text classifiers using reliability indicators: models and results

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Automated discovery of search interfaces on the web

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Crawling for Domain-Speci.c Hidden Web Resources

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
The Combination of Text Classifiers Using Reliability Indicators

Information Retrieval
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data management projects at Google

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A Design Principle for Coarse-to-Fine Classification

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Organizing Structured Deep Web by Clustering Query Interfaces Link Graph

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Learning to extract form labels

Proceedings of the VLDB Endowment
BioRegistry: automatic extraction of metadata for biological database retrieval and discovery

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Towards a universal marketplace over the web: statistical multi-label classification of service provider forms with simulated annealing

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Stop word and related problems in web interface integration

Proceedings of the VLDB Endowment
Generation of Specifications Forms through Statistical Learning for a Universal Services Marketplace

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Automatically constructing a directory of molecular biology databases

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Creating and exploring web form repositories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Collaborative identification and annotation of government deep web resources: a hybrid approach

Proceedings of the 21st ACM conference on Hypertext and hypermedia
BioRegistry: Automatic extraction of metadata for biological database retrieval and discovery

International Journal of Metadata, Semantics and Ontologies
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Domain-independent classification for deep web interfaces

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Searchable web sites recommendation

Proceedings of the fourth ACM international conference on Web search and data mining
Automatic identification of web query interfaces

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
OPAL: automated form understanding for the deep web

Proceedings of the 21st international conference on World Wide Web
Automatic discovery of Web Query Interfaces using machine learning techniques

Journal of Intelligent Information Systems
Understanding query interfaces by statistical parsing

ACM Transactions on the Web (TWEB)
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.