Evaluating topic-driven web crawlers

Authors:
Filippo Menczer;Gautam Pant;Padmini Srinivasan;Miguel E. Ruiz
Affiliations:
Univ. of Iowa, Iowa City;Univ. of Iowa, Iowa City;Univ. of Iowa, Iowa City;Textwise, Syracuse, NY
Venue:
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2001

Citing 22
Cited 55

Adaptive signal processing

Adaptive signal processing
Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Improving text retrieval for the routing problem using latent semantic indexing

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Adding support for dynamic and focused search with Fetuccino

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Link-based and content-based evidential information in a belief network model

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Does “authority” mean quality? predicting expert quality ratings of Web documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Information Retrieval

Information Retrieval
EXPONENTIATED GRADIENT VERSUS GRADIENT DESCENT FOR LINEAR PREDICTORS

EXPONENTIATED GRADIENT VERSUS GRADIENT DESCENT FOR LINEAR PREDICTORS
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Topic-oriented collaborative crawling

Proceedings of the eleventh international conference on Information and knowledge management
MySpiders: Evolve Your Own Intelligent Web Crawlers

Autonomous Agents and Multi-Agent Systems
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Agents, Crawlers, and Web Retrieval

CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Multiple-goal search algorithms and their application to web crawling

Eighteenth national conference on Artificial intelligence
Complementing search engines with online web mining agents

Decision Support Systems - Special issue: Web data mining
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Focused Crawling by Learning HMM from User's Topic-specific Browsing

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
ELA—A new Approach for Learning Agents

Autonomous Agents and Multi-Agent Systems
Learnable topic-specific web crawler

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
What's there and what's not?: focused crawling for missing documents in digital libraries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Quality and relevance of domain-specific search: A case study in mental health

Information Retrieval
Geographically focused collaborative crawling

Proceedings of the 15th international conference on World Wide Web
Web dynamics and their ramifications for the development of web search engines

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Focused crawling guided by link context

AIA'06 Proceedings of the 24th IASTED international conference on Artificial intelligence and applications
Automated gathering of Web information: An in-depth examination of agents interacting with search engines

ACM Transactions on Internet Technology (TOIT)
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Combining text and link analysis for focused crawling-An application for vertical search engines

Information Systems
The impact of term selection in genre-aware focused crawling

Proceedings of the 2008 ACM symposium on Applied computing
BioCrawler: An intelligent crawler for the semantic web

Expert Systems with Applications: An International Journal
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting Multiple Features with MEMMs for Focused Web Crawling

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
CRAWLING THE CONSTRUCTION WEB-A MACHINE-LEARNING APPROACH WITHOUT NEGATIVE EXAMPLES

Applied Artificial Intelligence
A cross-language focused crawling algorithm based on multiple relevance prediction strategies

Computers & Mathematics with Applications
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A Genre-Aware Approach to Focused Crawling

World Wide Web
Multiple-goal heuristic search

Journal of Artificial Intelligence Research
NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Proceedings of the VLDB Endowment
Application of rough ensemble classifier to web services categorization and focused crawling

Web Intelligence and Agent Systems
Intelligent web crawler

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Automatically constructing a directory of molecular biology databases

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
A domain-based intelligent search engine

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
Combining text and link analysis for focused crawling

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
Automatic generation and use of negative terms to evaluate topic-related web pages

HSI'05 Proceedings of the 3rd international conference on Human Society@Internet: web and Communication Technologies and Internet-Related Social Issues
Focused crawling using latent semantic indexing – an application for vertical search engines

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING

Computational Intelligence
Domain specific search in indian languages

Proceedings of the first workshop on Information and knowledge management for developing region
A classification framework for web robots

Journal of the American Society for Information Science and Technology
Topical crawling on the web through local site-searches

Journal of Web Engineering
Editorial: A topic-specific crawling strategy based on semantics similarity

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.