On the design of a learning crawler for topical resource discovery

Authors:
Charu C. Aggarwal;Fatima Al-Garawi;Philip S. Yu
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;Columbia University, New York, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2001

Citing 14
Cited 16

Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
On the merits of building categorization systems by supervised clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
WTMS: a system for collecting for collecting and analyzing topic-specific Web information

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Mining the Web's Link Structure

Computer
Distributed Hypertext Resource Discovery Through Examples

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Computing Geographical Scopes of Web Resources

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases

Focused crawling for both topical relevance and quality of medical information

Proceedings of the 14th ACM international conference on Information and knowledge management
Incremental mining of information interest for personalized web scanning

Information Systems
Quality and relevance of domain-specific search: A case study in mental health

Information Retrieval
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MedicoPort: A medical search engine for all

Computer Methods and Programs in Biomedicine
Locality-Based pruning methods for web search

ACM Transactions on Information Systems (TOIS)
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Exploiting Tags and Social Profiles to Improve Focused Crawling

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Incremental mining of information interest for personalized web scanning

Information Systems
A semantic search system using query definitions

Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia
A conceptual framework for efficient web crawling in virtual integration contexts

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
wHunter: a focused web crawler – a tool for digital library

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Topical crawling on the web through local site-searches

Journal of Web Engineering
A synergistic approach to efficient web searching

Intelligent Decision Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, the World Wide Web has shown enormous growth in size. Vast repositories of information are available on practically every possible topic. In such cases, it is valuable to perform topical resource discovery effectively. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the World Wide Web quickly, without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the World Wide Web while performing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries, or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. We discuss how to intelligently select features which are most useful for a given crawl. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates.