Mining Outliers in Correlated Subspaces for High Dimensional Data Sets

Authors:
Jinsong Leng;Tzung-Pei Hong
Affiliations:
School of Computer and Security Science, Edith Cowan University WA 6050, Australia. E-mail: J.Leng@ecu.edu.au;Department of Computer Science and Information Engineering National University of Kaohsiung, Taiwan. E-mail: tphong@nuk.edu.tw
Venue:
Fundamenta Informaticae - Intelligent Data Analysis in Granular Computing
Year:
2010

Citing 18
Cited 0

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Statistical Themes and Lessons for Data Mining

Data Mining and Knowledge Discovery
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
Biclustering of Expression Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Dynamic Programming

Dynamic Programming
d-Clusters: Capturing Subspace Correlation in a Large Data Set

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Improving Mining of Medical Data by Outliers Prediction

CBMS '05 Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems
Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance

Knowledge and Information Systems
BicAT: a biclustering analysis toolbox

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Outlier detection in high dimensional data sets is a challenging data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Searching for all possible subspaces can lead to the problem called "the curse of dimensionality". Due to the existence of many irrelevant dimensions in high dimensional data sets, it is of paramount importance to eliminate the irrelevant or unimportant dimensions and identify interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques or subspace-based clustering methods. The dimension-growth subspace clustering techniques can find interesting subspaces in relatively lower dimension spaces, while dimension-reduction approaches try to group interesting subspaces with larger dimensions. This paper aims to investigate the possibility of detecting outliers in correlated subspaces. We present a novel approach by identifying outliers in the correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ a dimension-reduction method to find the correlated subspaces. Based on the correlated subspaces obtained, we introduce another criterion called "shape factor" to rank most important subspaces in the projected subspaces. Finally, outliers are distinguished from most important subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional data sets.