A co-operative web services paradigm for supporting crawlers

  • Authors:
  • Susan Gauch;Aravind Chandramouli

  • Affiliations:
  • University of Kansas;University of Kansas

  • Venue:
  • A co-operative web services paradigm for supporting crawlers
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Search engines have become the major tool by which users locate information on the Web. Search engines create their collections through the use of crawlers, programs that download pages on the Web by following hyperlinks. However this approach for crawling has major problems. First, because most pages on the Web are not modified often, crawlers create needless traffic on the network and Websites by repeatedly downloading unchanged content. Secondly, since the Web is large, search engines can only refresh their collection periodically and the collection thus becomes outdated and contains a number of broken links. In addition, crawlers are also unable to collect pages that are created dynamically in response to database queries. Furthermore, crawlers are unable to download all pages on the Web due to limited resources and to avoid overloading Websites. Typically, they attempt to download the “important” pages using an URL ordering algorithm. The current approaches for URL ordering are based on link structure and they are expensive and/or miss many good pages. To address these issues, we present a collaborative approach where the Websites coordinate with the crawlers to provide increased capabilities. Our system supports a querying mechanism wherein the crawler can issue queries to the Web service on the Website and to answer these queries, we exploit valuable information present in the Web logs and file system on the Web server. We also investigate a novel URL ordering algorithm that exploits the access count information present in the Web logs on the individual Websites. In particular, we develop URL ordering algorithms based on internal and external counts and compare them empirically with a breadth first search crawl. To demonstrate the effectiveness of our collaborative approach, we performed experiments over an eight week period on the ITTC Website whose Web logs and file system we had access to. Our experiments show that our approach can be used to increase the number of pages collected by nearly 6 times when compared to traditional crawling and provides bandwidth savings of nearly 100% for the same set of pages. We also experimentally compare the relative merits of using only Web logs, only file system information, and combinations of the two sources to provide information for the Web service. Further, to demonstrate that by using the popularity information from Web logs we are able to retrieve high quality pages earlier in the crawl, we perform experiments on two data sets using the Web logs from ITTC and CiteSeer Websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google’s PageRank) of 57.2% and 65.7% over that of a breadth first search crawl.