A quantitative evaluation of techniques for detection of abnormal change events in blogs.

  • Authors:
  • Paul L. Bogen;Richard Furuta;Frank Shipman

  • Affiliations:
  • Oak Ridge National Laboratory, Knoxville, TN, USA;Texas A&M University, College Station, TX, USA;Texas A&M University, College Station, TX, USA

  • Venue:
  • Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

While most digital collections have limited forms of change--primarily creation and deletion of additional resources--there exists a class of digital collections that undergoes additional kinds of change. These collections are made up of resources that are distributed across the Internet and brought together into a collection via hyperlinking. Resources in these collections can be expected to change as time goes on. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. Others have tried to address this problem by measuring change and defining a maximum allowed threshold of change, however, these methods treat all change as a potential problem and treat web content as a static document despite its intrinsically dynamic nature. Instead, we approach the significance of change on the web as a normal part of a web document's life-cycle and determine the difference between what a maintainer expects a page to do and what it actually does. In this work we evaluate the different options for extractors and analyzers in order to determine the best options from a suite of techniques. The evaluation used a human-generated ground-truth set of blog changes. The results of this work showed a statistically significant improvement over a range of traditional threshold techniques when applied to our collection of tagged blog changes.