Google's Deep Web crawl

Authors:
Jayant Madhavan;David Ko;Łucja Kot;Vignesh Ganapathy;Alex Rasmussen;Alon Halevy
Affiliations:
Google Inc.;Google Inc.;Cornell University;Google Inc.;University of California, San Diego;Google Inc.
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 14
Cited 69

Answering queries using templates with binding patterns (extended abstract)

PODS '95 Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems (TODS)
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

A first tutorial on dataspaces

Proceedings of the VLDB Endowment
Web-scale extraction of structured data

ACM SIGMOD Record
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Web Site Metadata

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
A Reusable Model for Data-Centric Web Services

ICSR '09 Proceedings of the 11th International Conference on Software Reuse: Formal Foundations of Reuse and Domain Engineering
Learning Deep Web Crawling with Diverse Features

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
An empirical study on using hidden markov model for search interface segmentation

Proceedings of the 18th ACM conference on Information and knowledge management
Kosmix: high-performance topic exploration using the deep web

Proceedings of the VLDB Endowment
Answering web questions using structured data: dream or reality?

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Web Crawling

Foundations and Trends in Information Retrieval
Collection-integral source selection for uncooperative distributed information retrieval environments

Information Sciences: an International Journal
Automatically incorporating new sources in keyword search-based data integration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Optimizing content freshness of relations extracted from the web using keyword search

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Creating and exploring web form repositories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Structured data on the web

NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Dynamic symbolic database application testing

Proceedings of the Third International Workshop on Testing Database Systems
Understanding deep web search interfaces: a survey

ACM SIGMOD Record
HengHa: data harvesting detection on hidden databases

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
PruSM: a prudent schema matching approach for web forms

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Research proposal for distributed deep web search

PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Instance discovery and schema matching with applications to biological deep web data integration

DILS'10 Proceedings of the 7th international conference on Data integration in the life sciences
Structured data on the web

Communications of the ACM
Encapsulating multi-stepped web forms as web services

ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing
W-Ray: a strategy to publish deep web geographic data

ER'10 Proceedings of the 2010 international conference on Advances in conceptual modeling: applications and challenges
Searchable web sites recommendation

Proceedings of the fourth ACM international conference on Web search and data mining
Metadata and information structure design on websites – towards a web for all

International Journal of Knowledge and Web Intelligence
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Federated Search

Foundations and Trends in Information Retrieval
Real understanding of real estate forms

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Free-text search versus complex web forms

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Web site metadata

Journal of Web Engineering
Free-text search over complex web forms

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
A multi-collection latent topic model for federated search

Information Retrieval
An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

Information Sciences: an International Journal
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
TODWEB: training-less ontology based deep web source classification

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
OPAL: automated form understanding for the deep web

Proceedings of the 21st international conference on World Wide Web
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
LIEGE:: link entities in web lists with knowledge base

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Materialization of web data sources

Search Computing
Automatic discovery of Web Query Interfaces using machine learning techniques

Journal of Intelligent Information Systems
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Boosting retrieval of digital spoken content

KES'12 Proceedings of the 16th international conference on Knowledge Engineering, Machine Learning and Lattice Computing with Applications
Learning to crawl deep web

Information Systems
Searching the deep web using proactive phrase queries

Proceedings of the 22nd international conference on World Wide Web companion
Deep web entity monitoring

Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Hidden-Web induced by client-side scripting: an empirical study

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Towards simulation-based similarity of end user browsing processes

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Current challenges in web crawling

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Architecture specification of rule-based deep web crawler with indexer

International Journal of Knowledge and Web Intelligence
Topical crawling on the web through local site-searches

Journal of Web Engineering
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases
Formal concept analysis approach for data extraction from a limited deep web database

Journal of Intelligent Information Systems
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems
The Bowlogna ontology: Fostering open curricula and agile knowledge bases for Europe's higher education landscape

Semantic Web - Linked Data for science and education

Quantified Score

Hi-index	0.02

Visualization

Abstract

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.