Downloading textual hidden web content through keyword queries

Authors:
Alexandros Ntoulas;Petros Zerfos;Junghoo Cho
Affiliations:
UCLA;UCLA;UCLA
Venue:
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Year:
2005

Citing 15
Cited 47

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automatic discovery of language models for text databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The open archives initiative: building a low-barrier interoperability framework

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
DP9: an OAI gateway service for web crawlers

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Introduction to Algorithms

Introduction to Algorithms
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Automated discovery of search interfaces on the web

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Search Engine Coverage of the OAI-PMH Corpus

IEEE Internet Computing
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Distributed query sampling: a quality-conscious approach

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Efficient, automatic web resource harvesting

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Updating collection representations for federated search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
DeepBot: a focused crawler for accessing hidden web content

Proceedings of the 3rd international workshop on Data enginering issues in E-commerce and services: In conjunction with ACM Conference on Electronic Commerce (EC '07)
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
MokE: a tool for Mobile-ok evaluation of web content

W4A '08 Proceedings of the 2008 international cross-disciplinary conference on Web accessibility (W4A)
Transcendence: enabling a personal view of the deep web

Proceedings of the 13th international conference on Intelligent user interfaces
Selection and context scoping for digital video collections: an investigation of youtube and blogs

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Enriching Ontology for Deep Web Search

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Using genetic algorithms to evolve a population of topical queries

Information Processing and Management: an International Journal
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Siphon++: a hidden-webcrawler for keyword-based interfaces

Proceedings of the 17th ACM conference on Information and knowledge management
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Facilitating discovery on the private web using dataset digests

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Web-scale extraction of structured data

ACM SIGMOD Record
A practical method for browsing a relational database using a standard search engine

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Learning Deep Web Crawling with Diverse Features

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Kosmix: high-performance topic exploration using the deep web

Proceedings of the VLDB Endowment
Estimating deep web data source size by capture---recapture method

Information Retrieval
Web Crawling

Foundations and Trends in Information Retrieval
Adaptive focused crawling

The adaptive web
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Optimizing content freshness of relations extracted from the web using keyword search

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking bias in deep web size estimation using capture recapture method

Data & Knowledge Engineering
Facilitating discovery on the private web using dataset digests

International Journal of Metadata, Semantics and Ontologies
Structured data on the web

Communications of the ACM
Deep Web adaptive crawling based on minimum executable pattern

Journal of Intelligent Information Systems
A QIIIEP based domain specific hidden web crawler

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Efficient deep web crawling using reinforcement learning

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Optimal algorithms for crawling a hidden database in the web

Proceedings of the VLDB Endowment
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Materialization of web data sources

Search Computing
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
A Novel Architecture for Deep Web Crawler

International Journal of Information Technology and Web Engineering
Learning to crawl deep web

Information Systems
Mining a search engine's corpus without a query pool

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Architecture specification of rule-based deep web crawler with indexer

International Journal of Knowledge and Web Intelligence
A brief history of web crawlers

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users.In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only "entry point" to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.