NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Authors:
Ling Chen;Sourav S. Bhowmick;Wolfgang Nejdl
Affiliations:
L3S/University of Hannover, Hannover, Germany;Nanyang Technological University, Singapore;L3S/University of Hannover, Hannover, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 21
Cited 1

Meaningful change detection in structured data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on Trees and Graphs

Algorithms on Trees and Graphs
Mining Both Positive and Negative Association Rules

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Mining positive and negative association rules: an approach for confined rules

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Web Archiving

Web Archiving
FRACTURE mining: mining frequently and concurrently mutating structures from historical XML documents

Data & Knowledge Engineering - Special issue: WIDM 2004
A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites

IEEE Transactions on Knowledge and Data Engineering
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Information Life Cycle, Information Value and Data Management

DEXA '07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Data quality in web archiving

Proceedings of the 3rd workshop on Information credibility on the web

The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called near-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.