Answering queries using templates with binding patterns (extended abstract)
PODS '95 Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
QProber: A system for automatic classification of hidden-Web databases
ACM Transactions on Information Systems (TOIS)
Proceedings of the 27th International Conference on Very Large Data Bases
An interactive clustering-based approach to integrating source query interfaces on the deep Web
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Automatic complex schema matching across Web query interfaces: A correlation mining approach
ACM Transactions on Database Systems (TODS)
Communications of the ACM - ACM at sixty: a look back in time
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Instance-based schema matching for web databases by domain-specific query probing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A first tutorial on dataspaces
Proceedings of the VLDB Endowment
Web-scale extraction of structured data
ACM SIGMOD Record
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
A Reusable Model for Data-Centric Web Services
ICSR '09 Proceedings of the 11th International Conference on Software Reuse: Formal Foundations of Reuse and Domain Engineering
Learning Deep Web Crawling with Diverse Features
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
An empirical study on using hidden markov model for search interface segmentation
Proceedings of the 18th ACM conference on Information and knowledge management
Kosmix: high-performance topic exploration using the deep web
Proceedings of the VLDB Endowment
Answering web questions using structured data: dream or reality?
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Foundations and Trends in Information Retrieval
Information Sciences: an International Journal
Automatically incorporating new sources in keyword search-based data integration
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Optimizing content freshness of relations extracted from the web using keyword search
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Creating and exploring web form repositories
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Dynamic symbolic database application testing
Proceedings of the Third International Workshop on Testing Database Systems
Understanding deep web search interfaces: a survey
ACM SIGMOD Record
HengHa: data harvesting detection on hidden databases
Proceedings of the 2010 ACM workshop on Cloud computing security workshop
PruSM: a prudent schema matching approach for web forms
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Research proposal for distributed deep web search
PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
On building a search interface discovery system
RED'09 Proceedings of the 2nd international conference on Resource discovery
Instance discovery and schema matching with applications to biological deep web data integration
DILS'10 Proceedings of the 7th international conference on Data integration in the life sciences
Communications of the ACM
Encapsulating multi-stepped web forms as web services
ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing
W-Ray: a strategy to publish deep web geographic data
ER'10 Proceedings of the 2010 international conference on Advances in conceptual modeling: applications and challenges
Searchable web sites recommendation
Proceedings of the fourth ACM international conference on Web search and data mining
Metadata and information structure design on websites – towards a web for all
International Journal of Knowledge and Web Intelligence
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Foundations and Trends in Information Retrieval
Real understanding of real estate forms
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Free-text search versus complex web forms
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Journal of Web Engineering
Free-text search over complex web forms
IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
A multi-collection latent topic model for federated search
Information Retrieval
Information Sciences: an International Journal
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
TODWEB: training-less ontology based deep web source classification
Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
ACM Transactions on the Web (TWEB)
Efficient deep web crawling using reinforcement learning
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
OPAL: automated form understanding for the deep web
Proceedings of the 21st international conference on World Wide Web
Stratified k-means clustering over a deep web data source
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
LIEGE:: link entities in web lists with knowledge base
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal algorithms for crawling a hidden database in the web
Proceedings of the VLDB Endowment
Topic-Sensitive hidden-web crawling
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Materialization of web data sources
Search Computing
Automatic discovery of Web Query Interfaces using machine learning techniques
Journal of Intelligent Information Systems
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases
Journal of Intelligent Information Systems
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
A Novel Architecture for Deep Web Crawler
International Journal of Information Technology and Web Engineering
Assessing relevance and trust of the deep web sources and results based on inter-source agreement
ACM Transactions on the Web (TWEB)
Boosting retrieval of digital spoken content
KES'12 Proceedings of the 16th international conference on Knowledge Engineering, Machine Learning and Lattice Computing with Applications
Information Systems
Searching the deep web using proactive phrase queries
Proceedings of the 22nd international conference on World Wide Web companion
Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Hidden-Web induced by client-side scripting: an empirical study
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Towards simulation-based similarity of end user browsing processes
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Current challenges in web crawling
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Architecture specification of rule-based deep web crawler with indexer
International Journal of Knowledge and Web Intelligence
Topical crawling on the web through local site-searches
Journal of Web Engineering
The ontological key: automatically understanding and integrating forms to access the deep Web
The VLDB Journal — The International Journal on Very Large Data Bases
Formal concept analysis approach for data extraction from a limited deep web database
Journal of Intelligent Information Systems
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Selecting queries from sample to crawl deep web data sources
Web Intelligence and Agent Systems
Semantic Web - Linked Data for science and education
Hi-index | 0.02 |
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.