Capabilities of outlier detection schemes in large datasets, framework and methodologies

  • Authors:
  • Jian Tang;Zhixiang Chen;Ada Waichee Fu;David W. Cheung

  • Affiliations:
  • Memorial University of Newfoundland,St. John's, Department of Computer Science, Newfoundland, Canada;University of Texas-Pan American Edinburgh, Department of Computer Science, Texas, Newfoundland, USA;Chinese University of Hong Kong, Department of Computer Science and Engineering, Shatin, Newfoundland, Hong Kong;University of Hong Kong, Department of Computer Science and Information Systems, Pokfulam, Newfoundland, Hong Kong

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.