An adaptive crawler for locating hidden-Web entry points

Authors:
Luciano Barbosa;Juliana Freire
Affiliations:
University of Utah, Salt Lake City, UT;University of Utah, Salt Lake City, UT
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 21
Cited 31

Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
A Methodology to Retrieve Text Documents from Multiple Databases

IEEE Transactions on Knowledge and Data Engineering
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Data management projects at Google

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Combining classifiers to identify online databases

Proceedings of the 16th international conference on World Wide Web
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

MokE: a tool for Mobile-ok evaluation of web content

W4A '08 Proceedings of the 2008 international cross-disciplinary conference on Web accessibility (W4A)
Learning to extract form labels

Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Querying structured information sources on the web

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Adaptive geospatially focused crawling

Proceedings of the 18th ACM conference on Information and knowledge management
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
A web search methodology for different user typologies

CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
Web Crawling

Foundations and Trends in Information Retrieval
Automatically constructing a directory of molecular biology databases

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
A novel design of hidden web crawler using reinforcement learning based agents

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Optimizing content freshness of relations extracted from the web using keyword search

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Creating and exploring web form repositories

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Querying structured information sources on the Web

International Journal of Metadata, Semantics and Ontologies
PruSM: a prudent schema matching approach for web forms

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Layout object model for extracting the schema of web query interfaces

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Focusing on novelty: a crawling strategy to build diverse language models

Proceedings of the 20th ACM international conference on Information and knowledge management
Deep web integrated systems: current achievements and open issues

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Intelligent crawling of web applications for web archiving

Proceedings of the 21st international conference companion on World Wide Web
ProFoUnd: program-analysis-based form understanding

Proceedings of the 21st international conference companion on World Wide Web
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Automatic discovery of Web Query Interfaces using machine learning techniques

Journal of Intelligent Information Systems
E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Journal of Intelligent Information Systems
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
Understanding query interfaces by statistical parsing

ACM Transactions on the Web (TWEB)
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hidden-Web induced by client-side scripting: an empirical study

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Architecture specification of rule-based deep web crawler with indexer

International Journal of Knowledge and Web Intelligence
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributedmakes the problem of locating them especially challenging. We deal with this problem by using the contents ofpages to focus the crawl on a topic; by prioritizing promisinglinks within the topic; and by also following links that may not lead to immediate benefit. We propose a new frameworkwhereby crawlers automatically learn patterns of promisinglinks and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning. Our experiments over real Web pages in a representativeset of domains indicate that online learning leadsto significant gains in harvest rates' the adaptive crawlers retrieve up to three times as many forms as crawlers thatuse a fixed focus strategy.