A co-operative web services paradigm for supporting crawlers

  • Authors:
  • Aravind Chandramouli;Susan Gauch

  • Affiliations:
  • University of Kansas, Lawrence, KS;University of Kansas, Lawrence, KS

  • Venue:
  • Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The traditional crawlers used by search engines to build their collection of Web pages frequently gather unmodified pages that already exist in their collection. This creates unnecessary Internet traffic and wastes search engine resources during page collection and indexing. Generally, the crawlers are also unable to collect dynamic pages, causing them to miss valuable information, and they cannot easily detect deleted pages, resulting in outdated search engine collections. To address these issues, we propose a new Web services paradigm for Website/crawler interaction that is co-operative and exploits the information present in the Web logs and file system. Our system supports a querying mechanism wherein the crawler can issue queries to the Web service on the Website and then collect pages based on the information provided in response to the query. We present experimental results demonstrating that, when compared to traditional crawlers, this approach provides bandwidth savings, more complete Web page collections, and collections that are notified of deleted pages. We experimentally compare the relative merits of using only Web logs, only file system information, and combinations of the two sources to provide information for the Web service.