Archiving the web using page changes patterns: a case study

Authors:
Myriam Ben Saad;Stéphane Gançarski
Affiliations:
LIP6, University P. and M. Curie, Paris, France;LIP6, University P. and M. Curie, Paris, France
Venue:
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Year:
2011

Citing 29
Cited 7

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Keeping Up with the Changing Web

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A First Experience in Archiving the French Web

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Web usage mining: discovery and applications of usage patterns from Web data

ACM SIGKDD Explorations Newsletter
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Information diffusion through blogspace

Proceedings of the 13th international conference on World Wide Web
Scheduling Algorithms for Web Crawling

LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
C-Miner: Mining Block Correlations in Storage Systems

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Managing duplicates in a web archive

Proceedings of the 2006 ACM symposium on Applied computing
Web Archiving

Web Archiving
CP-Miner: a tool for finding copy-paste and related bugs in operating system code

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Efficient mining of XML query patterns for caching

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Exploiting idle CPU cores to improve file access performance

Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
Resonance on the web: web dynamics and revisitation patterns

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Data quality in web archiving

Proceedings of the 3rd workshop on Information credibility on the web
SHARC: framework for quality-conscious web archiving

Proceedings of the VLDB Endowment
Using visual pages analysis for optimizing web archiving

Proceedings of the 2010 EDBT/ICDT Workshops
Vi-DIFF: understanding web pages changes

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I

Improving the quality of web archives through the importance of changes

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
A quantitative evaluation of techniques for detection of abnormal change events in blogs.

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Reading the correct history?: modeling temporal intention in resource sharing

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Archival HTTP redirection retrieval policies

Proceedings of the 22nd international conference on World Wide Web companion
Archiving the relaxed consistency web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend), or more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive web sites. We first define our pattern model that describes the changes of pages. Then, we present the strategy used to (i) extract the temporal evolution of page changes, to (ii) discover patterns and to (iii) exploit them to improve web archives. We choose the archive of French public TV channels « France Télévisions » as a case study in order to validate our approach. Our experimental evaluation based on real web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.