The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Performance limitations of the Java core libraries
JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Salticus: guided crawling for personal digital libraries
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Extracting macroscopic information from Web links
Journal of the American Society for Information Science and Technology
Proceedings of the 11th international conference on World Wide Web
Extended static checking for Java
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Core services in the architecture of the national science digital library (NSDL)
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Topic-oriented collaborative crawling
Proceedings of the eleventh international conference on Information and knowledge management
Text Retrieval Systems for the Web
Programming and Computing Software
ACM Transactions on Information Systems (TOIS)
Proceedings of the 27th International Conference on Very Large Data Bases
Object-Extraction-Based Hidden Web Information Retrieval
WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Design and Implementation of a Distributed Crawler and Filtering Processor
NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Focused Crawls, Tunneling, and Digital Libraries
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
InfoPipes: A Flexible Framework for M-Commerce Applications
TES '01 Proceedings of the Second International Workshop on Technologies for E-Services
Houdini, an Annotation Assistant for ESC/Java
FME '01 Proceedings of the International Symposium of Formal Methods Europe on Formal Methods for Increasing Software Productivity
A Modular Checker for Multithreaded Programs
CAV '02 Proceedings of the 14th International Conference on Computer Aided Verification
Agents, Crawlers, and Web Retrieval
CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Monitoring the dynamic web to respond to continuous queries
WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
Handbook of massive data sets
Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Link Database: Fast Access to Graphs of the Web
DCC '02 Proceedings of the Data Compression Conference
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Semantic resource management for the web: an e-learning application
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
High performance crawling system
Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Teaching key topics in computer science and information systems through a web search engine project
Journal on Educational Resources in Computing (JERIC)
Client-system collaboration for legal corpus selection in an online production environment
ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law
Learnable topic-specific web crawler
Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Search Adaptations and the Challenges of the Web
IEEE Internet Computing
Bulk loading large collections of hyperlinked resources
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Characterizing a national community web
ACM Transactions on Internet Technology (TOIT)
Modular verification of multithreaded programs
Theoretical Computer Science
Mutable strings in Java: design, implementation and lightweight text-search algorithms
Science of Computer Programming - Special issue on principles and practice of programming in java (PPPJ 2003)
Geographically focused collaborative crawling
Proceedings of the 15th international conference on World Wide Web
Managing duplicates in a web archive
Proceedings of the 2006 ACM symposium on Applied computing
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Modelling information persistence on the web
ICWE '06 Proceedings of the 6th international conference on Web engineering
Stanford WebBase components and applications
ACM Transactions on Internet Technology (TOIT)
Architecture of a grid-enabled Web search engine
Information Processing and Management: an International Journal
Characterization of national Web domains
ACM Transactions on Internet Technology (TOIT)
A cost-effective method for detecting web site replicas on search engine databases
Data & Knowledge Engineering
Implementation of a modern web search engine cluster
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Tracking website data-collection and privacy practices with the iWatch web crawler
Proceedings of the 3rd symposium on Usable privacy and security
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
I/O-conscious data preparation for large-scale web search engines
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Effective change detection using sampling
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The Viúva Negra crawler: an experience report
Software—Practice & Experience
Improving Web site understanding with keyword-based clustering
Journal of Software Maintenance and Evolution: Research and Practice
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
SpidersRUs: Creating specialized search engines in multiple languages
Decision Support Systems
Aggregate documents: making sense of a patchwork of topical documents
Proceedings of the eighth ACM symposium on Document engineering
Parallel crawler architecture and web page change detection
WSEAS Transactions on Computers
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
A Genre-Aware Approach to Focused Crawling
World Wide Web
On the feasibility of multi-site web search engines
Proceedings of the 18th ACM conference on Information and knowledge management
Foundations and Trends in Information Retrieval
Technologies and the development of the Automated Metadata Indexing and Analysis (AMIA) system
Journal of Visual Communication and Image Representation
Eliminate redundancy in parallel search: a multi-agent coordination approach
PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Implementation of a web robot and statistics on the Korean web
HSI'03 Proceedings of the 2nd international conference on Human.society@internet
The adaptive web
Efficiently detecting webpage updates using samples
ICWE'07 Proceedings of the 7th international conference on Web engineering
CRAYSE: design and implementation of efficient text search algorithm in a web crawler
ACM SIGSOFT Software Engineering Notes
Clustering-based incremental web crawling
ACM Transactions on Information Systems (TOIS)
The architecture and implementation of an extensible web crawler
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Don't tread on me: moderating access to OSN data with spikestrip
WOSN'10 Proceedings of the 3rd conference on Online social networks
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Macroscopic characterisations of Web accessibility
The New Review of Hypermedia and Multimedia - Web Accessibility
Automated browsing in AJAX websites
Data & Knowledge Engineering
A robust link-translating proxy server mirroring the whole web
ACM SIGAPP Applied Computing Review
Foundations and Trends in Information Retrieval
Offline web browsing for mobile devices
Journal of Web Engineering
Journal of Web Engineering
Practical elimination of external interaction vulnerabilities in web applications
Journal of Web Engineering
Online social honeynets: trapping web crawlers in OSN
MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
A new approach for verifying URL uniqueness in web crawlers
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine
Web Semantics: Science, Services and Agents on the World Wide Web
IglooG: a distributed web crawler based on grid service
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Distributed high-performance web crawler based on peer-to-peer network
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
ACM Transactions on the Web (TWEB)
Design and selection criteria for a national web archive
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Multi-modal services for web information collection based on multi-agent techniques
PRIMA'06 Proceedings of the 9th Pacific Rim international conference on Agent Computing and Multi-Agent Systems
MultiCrawler: a pipelined architecture for crawling and indexing semantic web data
ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Reliable evaluations of URL normalization
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
How to evaluate the effectiveness of URL normalizations
HSI'05 Proceedings of the 3rd international conference on Human Society@Internet: web and Communication Technologies and Internet-Related Social Issues
A platform for large-scale machine learning on web design
CHI '12 Extended Abstracts on Human Factors in Computing Systems
Proceedings of the 15th International Conference on Extending Database Technology
A distributed middleware infrastructure for personalized services
Computer Communications
Cloudpress 2.0: a next generation news retrieval system on the cloud with a built-in summarizer
Proceedings of the International Conference on Advances in Computing, Communications and Informatics
A distributed, semiotic-inductive, and human-oriented approach to web-scale knowledge retrieval
Proceedings of the 2012 international workshop on Web-scale knowledge representation, retrieval and reasoning
Designing a fast file system crawler with incremental differencing
ACM SIGOPS Operating Systems Review
OXPath: A language for scalable data extraction, automation, and crawling on the deep web
The VLDB Journal — The International Journal on Very Large Data Bases
Cloudpress 2.0: a new-age news retrieval system on the cloud
International Journal of Information and Communication Technology
Determining the conceptual space of metaphoric expressions
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
A comparison of web robot and human requests
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
PLDI 2002: Extended static checking for Java
ACM SIGPLAN Notices - Supplemental issue
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams
Proceedings of the VLDB Endowment
A brief history of web crawlers
CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Hi-index | 0.00 |
This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published.