Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Ranking user's relevance to a topic through link analysis on web logs
Proceedings of the 4th international workshop on Web information and data management
Who Links to Whom: Mining Linkage between Web Sites
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Implicit link analysis for small web search
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Lexical and semantic clustering by web links
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Foundations and Trends in Information Retrieval
Identifying Web Spam with the Wisdom of the Crowds
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
One of the most actual problems on web crawling -- the most expensive task of any search engine, in terms of time and bandwidth consumption -- is the detection of useless segments of Internet. In some cases such segments are purposely created to deceive the crawling engine while, in others, they simply do not contain any useful information. Currently, the typical approach to the problem consists in using a human-compiled blacklist of sites to avoid (e.g., advertising sites and web counters), but, due to the strongly dynamical nature of Internet, keeping them manually up-to-date is quite unfeasible. In this work we present a web usage statistics-based solution to the problem, aimed at automatically -- and, therefore, dynamically -- building blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting. Our method performs a linear time complexity analysis on the traffic information which yields an abstraction of the linked web which can be incrementally up-dated, therefore allowing a streaming computation. The crawler can use the list produced in this way to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without analysing the content of crawled documents.