Incremental web-site boundary detection using random walks

Authors:
Ayesh Alshukri;Frans Coenen;Michele Zito
Affiliations:
Department of Computer Science, University of Liverpool, Liverpool, UK;Department of Computer Science, University of Liverpool, Liverpool, UK;Department of Computer Science, University of Liverpool, Liverpool, UK
Venue:
MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Year:
2011

Citing 21
Cited 1

Using predictive prefetching to improve World Wide Web latency

ACM SIGCOMM Computer Communication Review
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Data mining: concepts and techniques

Data mining: concepts and techniques
Web page change and persistence---a four-year longitudinal study

Journal of the American Society for Information Science and Technology
Data Mining: Introductory and Advanced Topics

Data Mining: Introductory and Advanced Topics
A First Experience in Archiving the French Web

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Who Links to Whom: Mining Linkage between Web Sites

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Stochastic models for the Web graph

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Web Searching and Information Retrieval

Computing in Science and Engineering
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
Exploring the bounds of web latency reduction from caching and prefetching

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Detection of Web Subsites: Concepts, Algorithms, and Evaluation Issues

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
As we may perceive: finding the boundaries of compound documents on the web

Proceedings of the 17th international conference on World Wide Web
Random walks, universal traversal sequences, and the complexity of maze problems

SFCS '79 Proceedings of the 20th Annual Symposium on Foundations of Computer Science
Web-site boundary detection

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Identifying websites with flow simulation

ICWE'05 Proceedings of the 5th international conference on Web Engineering

Mining groups of common interest: discovering topical communities with network flows

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper describes variations of the classical k-means clustering algorithm that can be used effectively to address the so called Web-site Boundary Detection (WBD) problem. The suggested advantages offered by these techniques are that they can quickly identify most of the pages belonging to a web-site; and, in the long run, return a solution of comparable (if not better) accuracy than other clustering methods. We analyze our techniques on artificial clones of the web generated using a well-known preferential attachment method.