Discovery of ads web hosts through traffic data analysis

  • Authors:
  • V. Bacarella;F. Giannotti;M. Nanni;D. Pedreschi

  • Affiliations:
  • University of Pisa, Pisa, Italy;ISTI-CNR, Pisa, Italy;ISTI-CNR, Pisa, Italy;University of Pisa, Pisa, Italy

  • Venue:
  • Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most actual problems on web crawling -- the most expensive task of any search engine, in terms of time and bandwidth consumption -- is the detection of useless segments of Internet. In some cases such segments are purposely created to deceive the crawling engine while, in others, they simply do not contain any useful information. Currently, the typical approach to the problem consists in using a human-compiled blacklist of sites to avoid (e.g., advertising sites and web counters), but, due to the strongly dynamical nature of Internet, keeping them manually up-to-date is quite unfeasible. In this work we present a web usage statistics-based solution to the problem, aimed at automatically -- and, therefore, dynamically -- building blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting. Our method performs a linear time complexity analysis on the traffic information which yields an abstraction of the linked web which can be incrementally up-dated, therefore allowing a streaming computation. The crawler can use the list produced in this way to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without analysing the content of crawled documents.