Discovery of ads web hosts through traffic data analysis

Authors:
V. Bacarella;F. Giannotti;M. Nanni;D. Pedreschi
Affiliations:
University of Pisa, Pisa, Italy;ISTI-CNR, Pisa, Italy;ISTI-CNR, Pisa, Italy;University of Pisa, Pisa, Italy
Venue:
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Year:
2004

Citing 8
Cited 2

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Ranking user's relevance to a topic through link analysis on web logs

Proceedings of the 4th international workshop on Web information and data management
Who Links to Whom: Mining Linkage between Web Sites

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Implicit link analysis for small web search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics

Adversarial Web Search

Foundations and Trends in Information Retrieval
Identifying Web Spam with the Wisdom of the Crowds

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most actual problems on web crawling -- the most expensive task of any search engine, in terms of time and bandwidth consumption -- is the detection of useless segments of Internet. In some cases such segments are purposely created to deceive the crawling engine while, in others, they simply do not contain any useful information. Currently, the typical approach to the problem consists in using a human-compiled blacklist of sites to avoid (e.g., advertising sites and web counters), but, due to the strongly dynamical nature of Internet, keeping them manually up-to-date is quite unfeasible. In this work we present a web usage statistics-based solution to the problem, aimed at automatically -- and, therefore, dynamically -- building blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting. Our method performs a linear time complexity analysis on the traffic information which yields an abstraction of the linked web which can be incrementally up-dated, therefore allowing a streaming computation. The crawler can use the list produced in this way to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without analysing the content of crawled documents.