A General Evaluation Framework for Topical Crawlers

Authors:
P. Srinivasan;F. Menczer;G. Pant
Affiliations:
School of Library & Information Science and Department of Management Sciences, The University of Iowa, Iowa City, USA 52242;School of Informatics and Department of Computer Science, Indiana University, Bloomington, USA 47408;School of Accounting and Information Systems, University of Utah, Salt Lake City, USA 84112
Venue:
Information Retrieval
Year:
2005

Citing 33
Cited 29

Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Adaptive information agents in distributed textual environments

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Adding support for dynamic and focused search with Fetuccino

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Link-based and content-based evidential information in a belief network model

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Does “authority” mean quality? predicting expert quality ratings of Web documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
The structure of broad topics on the web

Proceedings of the 11th international conference on World Wide Web
Information Retrieval

Information Retrieval
MySpiders: Evolve Your Own Intelligent Web Crawlers

Autonomous Agents and Multi-Agent Systems
ARCCHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Complementing search engines with online web mining agents

Decision Support Systems - Special issue: Web data mining
Stochastic models for the Web graph

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics

Small world peer networks in distributed web search

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Generalizing PageRank: damping functions for link-based ranking algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Interest-based personalized search

ACM Transactions on Information Systems (TOIS)
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
Repeatable evaluation of search services in dynamic environments

ACM Transactions on Information Systems (TOIS)
The impact of term selection in genre-aware focused crawling

Proceedings of the 2008 ACM symposium on Applied computing
SVM based adaptive learning method for text classification from positive and unlabeled documents

Knowledge and Information Systems
Exploiting Multiple Features with MEMMs for Focused Web Crawling

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
CRAWLING THE CONSTRUCTION WEB-A MACHINE-LEARNING APPROACH WITHOUT NEGATIVE EXAMPLES

Applied Artificial Intelligence
A three-year study on the freshness of web search engine databases

Journal of Information Science
A cross-language focused crawling algorithm based on multiple relevance prediction strategies

Computers & Mathematics with Applications
Reinforcement Learning with Classifier Selection for Focused Crawling

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
A Genre-Aware Approach to Focused Crawling

World Wide Web
Adaptive geospatially focused crawling

Proceedings of the 18th ACM conference on Information and knowledge management
SCTWC: An online semi-supervised clustering approach to topical web crawlers

Applied Soft Computing
Towards a graph-based user profile modeling for a session-based personalized search

Knowledge and Information Systems
Adaptive focused crawling

The adaptive web
Addressing the limited scope problem of focused crawling using a result merging approach

Proceedings of the 2010 ACM Symposium on Applied Computing
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
An effective relevance prediction algorithm based on hierarchical taxonomy for focused crawling

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Where to crawl next for focused crawlers

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers

ACM Transactions on Information Systems (TOIS)
Domain specific search in indian languages

Proceedings of the first workshop on Information and knowledge management for developing region

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.