Mercator: A scalable, extensible Web crawler

Authors:
Allan Heydon;Marc Najork
Affiliations:
Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA E-mail: {heydon,najork}@pa.dec.com;Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA E-mail: {heydon,najork}@pa.dec.com
Venue:
World Wide Web
Year:
1999

Citing 6
Cited 107

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Performance limitations of the Java core libraries

JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM

An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Salticus: guided crawling for personal digital libraries

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Extracting macroscopic information from Web links

Journal of the American Society for Information Science and Technology
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Extended static checking for Java

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Core services in the architecture of the national science digital library (NSDL)

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Topic-oriented collaborative crawling

Proceedings of the eleventh international conference on Information and knowledge management
Text Retrieval Systems for the Web

Programming and Computing Software
Early user---system interaction for database selection in massive domain-specific online environments

ACM Transactions on Information Systems (TOIS)
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Object-Extraction-Based Hidden Web Information Retrieval

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Design and Implementation of a Distributed Crawler and Filtering Processor

NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Focused Crawls, Tunneling, and Digital Libraries

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
InfoPipes: A Flexible Framework for M-Commerce Applications

TES '01 Proceedings of the Second International Workshop on Technologies for E-Services
Houdini, an Annotation Assistant for ESC/Java

FME '01 Proceedings of the International Symposium of Formal Methods Europe on Formal Methods for Increasing Software Productivity
A Modular Checker for Multithreaded Programs

CAV '02 Proceedings of the 14th International Conference on Computer Aided Verification
Agents, Crawlers, and Web Retrieval

CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Monitoring the dynamic web to respond to continuous queries

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
High-performance web crawling

Handbook of massive data sets
Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Link Database: Fast Access to Graphs of the Web

DCC '02 Proceedings of the Data Compression Conference
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Semantic resource management for the web: an e-learning application

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
High performance crawling system

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Teaching key topics in computer science and information systems through a web search engine project

Journal on Educational Resources in Computing (JERIC)
Client-system collaboration for legal corpus selection in an online production environment

ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law
Learnable topic-specific web crawler

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Search Adaptations and the Challenges of the Web

IEEE Internet Computing
Bulk loading large collections of hyperlinked resources

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Characterizing a national community web

ACM Transactions on Internet Technology (TOIT)
Modular verification of multithreaded programs

Theoretical Computer Science
Mutable strings in Java: design, implementation and lightweight text-search algorithms

Science of Computer Programming - Special issue on principles and practice of programming in java (PPPJ 2003)
Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces

World Wide Web
Geographically focused collaborative crawling

Proceedings of the 15th international conference on World Wide Web
Managing duplicates in a web archive

Proceedings of the 2006 ACM symposium on Applied computing
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
Stanford WebBase components and applications

ACM Transactions on Internet Technology (TOIT)
Architecture of a grid-enabled Web search engine

Information Processing and Management: an International Journal
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)
A cost-effective method for detecting web site replicas on search engine databases

Data & Knowledge Engineering
Implementation of a modern web search engine cluster

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Tracking website data-collection and privacy practices with the iWatch web crawler

Proceedings of the 3rd symposium on Usable privacy and security
Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
I/O-conscious data preparation for large-scale web search engines

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The Viúva Negra crawler: an experience report

Software—Practice & Experience
Improving Web site understanding with keyword-based clustering

Journal of Software Maintenance and Evolution: Research and Practice
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
SpidersRUs: Creating specialized search engines in multiple languages

Decision Support Systems
Aggregate documents: making sense of a patchwork of topical documents

Proceedings of the eighth ACM symposium on Document engineering
Parallel crawler architecture and web page change detection

WSEAS Transactions on Computers
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
A Genre-Aware Approach to Focused Crawling

World Wide Web
On the feasibility of multi-site web search engines

Proceedings of the 18th ACM conference on Information and knowledge management
Web Crawling

Foundations and Trends in Information Retrieval
Technologies and the development of the Automated Metadata Indexing and Analysis (AMIA) system

Journal of Visual Communication and Image Representation
Eliminate redundancy in parallel search: a multi-agent coordination approach

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Implementation of a web robot and statistics on the Korean web

HSI'03 Proceedings of the 2nd international conference on Human.society@internet
Adaptive focused crawling

The adaptive web
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
CRAYSE: design and implementation of efficient text search algorithm in a web crawler

ACM SIGSOFT Software Engineering Notes
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
The architecture and implementation of an extensible web crawler

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Don't tread on me: moderating access to OSN data with spikestrip

WOSN'10 Proceedings of the 3rd conference on Online social networks
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Macroscopic characterisations of Web accessibility

The New Review of Hypermedia and Multimedia - Web Accessibility
Automated browsing in AJAX websites

Data & Knowledge Engineering
A robust link-translating proxy server mirroring the whole web

ACM SIGAPP Applied Computing Review
Adversarial Web Search

Foundations and Trends in Information Retrieval
Offline web browsing for mobile devices

Journal of Web Engineering
Crawling the infinite web

Journal of Web Engineering
Practical elimination of external interaction vulnerabilities in web applications

Journal of Web Engineering
Online social honeynets: trapping web crawlers in OSN

MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
A new approach for verifying URL uniqueness in web crawlers

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine

Web Semantics: Science, Services and Agents on the World Wide Web
IglooG: a distributed web crawler based on grid service

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
On URL normalization

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Distributed high-performance web crawler based on peer-to-peer network

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Design and selection criteria for a national web archive

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Multi-modal services for web information collection based on multi-agent techniques

PRIMA'06 Proceedings of the 9th Pacific Rim international conference on Agent Computing and Multi-Agent Systems
MultiCrawler: a pipelined architecture for crawling and indexing semantic web data

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Reliable evaluations of URL normalization

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
How to evaluate the effectiveness of URL normalizations

HSI'05 Proceedings of the 3rd international conference on Human Society@Internet: web and Communication Technologies and Internet-Related Social Issues
A platform for large-scale machine learning on web design

CHI '12 Extended Abstracts on Human Factors in Computing Systems
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Proceedings of the 15th International Conference on Extending Database Technology
A distributed middleware infrastructure for personalized services

Computer Communications
Cloudpress 2.0: a next generation news retrieval system on the cloud with a built-in summarizer

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
A distributed, semiotic-inductive, and human-oriented approach to web-scale knowledge retrieval

Proceedings of the 2012 international workshop on Web-scale knowledge representation, retrieval and reasoning
Designing a fast file system crawler with incremental differencing

ACM SIGOPS Operating Systems Review
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The VLDB Journal — The International Journal on Very Large Data Bases
Cloudpress 2.0: a new-age news retrieval system on the cloud

International Journal of Information and Communication Technology
Determining the conceptual space of metaphoric expressions

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
A comparison of web robot and human requests

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
PLDI 2002: Extended static checking for Java

ACM SIGPLAN Notices - Supplemental issue
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams

Proceedings of the VLDB Endowment
A brief history of web crawlers

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published.