A co-operative web services paradigm for supporting crawlers

Authors:
Aravind Chandramouli;Susan Gauch
Affiliations:
University of Kansas, Lawrence, KS;University of Kansas, Lawrence, KS
Venue:
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Year:
2007

Citing 10
Cited 0

Discriminating meta-search: a framework for evaluation

Information Processing and Management: an International Journal - Special issue on progress toward digital libraries
Crawler-Friendly Web Servers

ACM SIGMETRICS Performance Evaluation Review
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Internet Search Engine Freshness by Web Server Help

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
Architectural styles and the design of network-based software architectures

Architectural styles and the design of network-based software architectures
Cooperative Crawling

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The traditional crawlers used by search engines to build their collection of Web pages frequently gather unmodified pages that already exist in their collection. This creates unnecessary Internet traffic and wastes search engine resources during page collection and indexing. Generally, the crawlers are also unable to collect dynamic pages, causing them to miss valuable information, and they cannot easily detect deleted pages, resulting in outdated search engine collections. To address these issues, we propose a new Web services paradigm for Website/crawler interaction that is co-operative and exploits the information present in the Web logs and file system. Our system supports a querying mechanism wherein the crawler can issue queries to the Web service on the Website and then collect pages based on the information provided in response to the query. We present experimental results demonstrating that, when compared to traditional crawlers, this approach provides bandwidth savings, more complete Web page collections, and collections that are notified of deleted pages. We experimentally compare the relative merits of using only Web logs, only file system information, and combinations of the two sources to provide information for the Web service.