Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

  • Authors:
  • Matthew Eric Otey;Amol Ghoting;Srinivasan Parthasarathy

  • Affiliations:
  • Department of Computer Science and Engineering,, The Ohio State University, Columbus, USA 43210;Department of Computer Science and Engineering,, The Ohio State University, Columbus, USA 43210;Department of Computer Science and Engineering,, The Ohio State University, Columbus, USA 43210

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are still several challenges that must be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets.