Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Database techniques for the World-Wide Web: a survey
ACM SIGMOD Record
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automatic discovery of language models for text databases
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Probe, count, and classify: categorizing hidden web databases
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The open archives initiative: building a low-barrier interoperability framework
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
DP9: an OAI gateway service for web crawlers
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Introduction to Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Automated discovery of search interfaces on the web
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Statistical schema matching across web query interfaces
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Search Engine Coverage of the OAI-PMH Corpus
IEEE Internet Computing
To search or to crawl?: towards a query optimizer for text-centric tasks
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Distributed query sampling: a quality-conscious approach
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Efficient, automatic web resource harvesting
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Updating collection representations for federated search
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
DeepBot: a focused crawler for accessing hidden web content
Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Towards a query optimizer for text-centric tasks
ACM Transactions on Database Systems (TODS)
MokE: a tool for Mobile-ok evaluation of web content
W4A '08 Proceedings of the 2008 international cross-disciplinary conference on Web accessibility (W4A)
Transcendence: enabling a personal view of the deep web
Proceedings of the 13th international conference on Intelligent user interfaces
Selection and context scoping for digital video collections: an investigation of youtube and blogs
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Enriching Ontology for Deep Web Search
DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Using genetic algorithms to evolve a population of topical queries
Information Processing and Management: an International Journal
Proceedings of the VLDB Endowment
Siphon++: a hidden-webcrawler for keyword-based interfaces
Proceedings of the 17th ACM conference on Information and knowledge management
An Approach to Deep Web Crawling by Sampling
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Facilitating discovery on the private web using dataset digests
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Web-scale extraction of structured data
ACM SIGMOD Record
A practical method for browsing a relational database using a standard search engine
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Crawling Deep Web Using a New Set Covering Algorithm
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Learning Deep Web Crawling with Diverse Features
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Kosmix: high-performance topic exploration using the deep web
Proceedings of the VLDB Endowment
Estimating deep web data source size by capture---recapture method
Information Retrieval
Foundations and Trends in Information Retrieval
The adaptive web
Crawling the content hidden behind web forms
ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Optimizing content freshness of relations extracted from the web using keyword search
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method
Data & Knowledge Engineering
Facilitating discovery on the private web using dataset digests
International Journal of Metadata, Semantics and Ontologies
Communications of the ACM
Deep Web adaptive crawling based on minimum executable pattern
Journal of Intelligent Information Systems
A QIIIEP based domain specific hidden web crawler
Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
ACM Transactions on the Web (TWEB)
Efficient deep web crawling using reinforcement learning
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Optimal algorithms for crawling a hidden database in the web
Proceedings of the VLDB Endowment
Topic-Sensitive hidden-web crawling
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Materialization of web data sources
Search Computing
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
A Novel Architecture for Deep Web Crawler
International Journal of Information Technology and Web Engineering
Information Systems
Mining a search engine's corpus without a query pool
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Architecture specification of rule-based deep web crawler with indexer
International Journal of Knowledge and Web Intelligence
A brief history of web crawlers
CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Selecting queries from sample to crawl deep web data sources
Web Intelligence and Agent Systems
Hi-index | 0.02 |
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users.In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only "entry point" to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.