Topical web crawlers: Evaluating adaptive algorithms

Authors:
Filippo Menczer;Gautam Pant;Padmini Srinivasan
Affiliations:
Indiana University, Bloomington, IN;University of Utah, Salt Lake City, UT;University of Iowa, Iowa City, IA
Venue:
ACM Transactions on Internet Technology (TOIT)
Year:
2004

Citing 32
Cited 60

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Evolving a multi-agent information filtering solution in Amalthaea

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Adaptive information agents in distributed textual environments

AGENTS '98 Proceedings of the second international conference on Autonomous agents
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Towards a better understanding of Web resources and server responses for improved caching

WWW '99 Proceedings of the eighth international conference on World Wide Web
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Adding support for dynamic and focused search with Fetuccino

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
MySpiders: Evolve Your Own Intelligent Web Crawlers

Autonomous Agents and Multi-Agent Systems
Web Search Using a Genetic Algorithm

IEEE Internet Computing
A Topic-Specific Web Robot Model Based on Restless Bandits

IEEE Internet Computing
Self-Organization and Identification of Web Communities

Computer
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Complementing search engines with online web mining agents

Decision Support Systems - Special issue: Web data mining
Stochastic models for the Web graph

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics

Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Exploiting Interclass Rules for Focused Crawling

IEEE Intelligent Systems
Suggesting novel but related topics: towards context-based support for knowledge model extension

Proceedings of the 10th international conference on Intelligent user interfaces
Learnable topic-specific web crawler

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Adaptive query routing in peer web search

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
What's there and what's not?: focused crawling for missing documents in digital libraries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Mapping the Semantics of Web Text and Links

IEEE Internet Computing
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Interest-based personalized search

ACM Transactions on Information Systems (TOIS)
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
The impact of term selection in genre-aware focused crawling

Proceedings of the 2008 ACM symposium on Applied computing
Exploiting Multiple Features with MEMMs for Focused Web Crawling

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Using genetic algorithms to evolve a population of topical queries

Information Processing and Management: an International Journal
A cross-language focused crawling algorithm based on multiple relevance prediction strategies

Computers & Mathematics with Applications
A semi-supervised incremental algorithm to automatically formulate topical queries

Information Sciences: an International Journal
Advanced AI techniques for web mining

MAMECTIS'08 Proceedings of the 10th WSEAS international conference on Mathematical methods, computational techniques and intelligent systems
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Improving the performance of focused web crawlers

Data & Knowledge Engineering
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
A Genre-Aware Approach to Focused Crawling

World Wide Web
Exploiting Tags and Social Profiles to Improve Focused Crawling

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Adaptive geospatially focused crawling

Proceedings of the 18th ACM conference on Information and knowledge management
ExSearch: a novel vertical search engine for online barter business

Proceedings of the 18th ACM conference on Information and knowledge management
SCTWC: An online semi-supervised clustering approach to topical web crawlers

Applied Soft Computing
FICA: A novel intelligent crawling algorithm based on reinforcement learning

Web Intelligence and Agent Systems
Towards a graph-based user profile modeling for a session-based personalized search

Knowledge and Information Systems
Multi-objective Query Optimization Using Topic Ontologies

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Web Crawling

Foundations and Trends in Information Retrieval
PaMS: A component-based service for finding the missing full text of articles cataloged in a digital library

Information Systems
Adaptive focused crawling

The adaptive web
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
News page discovery policy for instant crawlers

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
An effective relevance prediction algorithm based on hierarchical taxonomy for focused crawling

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Connectivity of the Thai web graph

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Agents, bookmarks and clicks: a topical model of web navigation

Proceedings of the 21st ACM conference on Hypertext and hypermedia
A Web page classification system based on a genetic algorithm using tagged-terms as features

Expert Systems with Applications: An International Journal
Architecture for a parallel focused crawler for clickstream analysis

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
A constrained crawling approach and its application to a specialised search engine

International Journal of Information and Communication Technology
An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

Information Sciences: an International Journal
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
A novel p2p information clustering and retrieval mechanism

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
LocalRank: ranking web pages considering geographical locality by integrating web and databases

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Automatic generation and use of negative terms to evaluate topic-related web pages

HSI'05 Proceedings of the 3rd international conference on Human Society@Internet: web and Communication Technologies and Internet-Related Social Issues
An incremental approach to link evaluation in topic-driven web resource discovery

AAIM'05 Proceedings of the First international conference on Algorithmic Applications in Management
Ontology based web crawling – a novel approach

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
ARCOMEM: from collect-all ARchives to COmmunity MEMories

Proceedings of the 21st international conference companion on World Wide Web
Looking for non-existent information: a consumer-led interactive search approach

BCS-HCI '11 Proceedings of the 25th BCS Conference on Human-Computer Interaction
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence
Research on new algorithm of topic-oriented crawler and duplicated web pages detection

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers

ACM Transactions on Information Systems (TOIS)
Domain specific search in indian languages

Proceedings of the first workshop on Information and knowledge management for developing region
Exploiting the social and semantic web for guided web archiving

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
A novel shark-search algorithm for theme crawler

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
An analyst-adaptive approach to focused crawlers

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Topical crawling on the web through local site-searches

Journal of Web Engineering
An approach for selecting seed URLs of focused crawler based on user-interest ontology

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best nonadaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.