Resource-bounded Outlier Detection using Clustering Methods

Authors:
Luis Torgo;Carlos Soares
Affiliations:
LIAAD/INESC Porto LA, Universidade do Porto, Portugal and Faculdade de Ciências, Universidade do Porto, Portugal;LIAAD/INESC Porto LA, Universidade do Porto, Portugal and Faculdade de Economia, Universidade do Porto, Portugal
Venue:
Proceedings of the 2010 conference on Data Mining for Business Applications
Year:
2010

Citing 7
Cited 3

Computational geometry: an introduction

Computational geometry: an introduction
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
OPTICS-OF: Identifying Local Outliers

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Resource-bounded fraud detection

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence

Detecting Errors in Foreign Trade Transactions: Dealing with Insufficient Data

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Data Mining for Business Applications: Introduction

Proceedings of the 2010 conference on Data Mining for Business Applications
Improving gaussian process classification with outlier detection: with applications in image classification

ACCV'10 Proceedings of the 10th Asian conference on Computer vision - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data collected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important impact on the statistics produced by the institute. The detectiong of these rare errors is a manual, time-consuming task. This type of tasks is usually constrained by a limited amount of available resources. Our proposal addresses this issue by producing a ranking of outlyingness that allows a better management of the available resources by allocating them to the cases which are most different from the other and, thus, have a higher probability of being errors. Our method is based on the output of standard agglomerative hierarchical clustering algorithms, resulting in no significant additional computational costs. Our results show that it enables large savings by selecting a small subset of suspicious transactions for manual inspection, which, nevertheless, includes most of the erroneous transactions. In this study we compare our proposal to a state of the art outlier ranking method (LOF) and show that our method achieves better results on this particular application. The results of our experiments are also competitive with previous results on the same data. Finally, the outcome of our experiments raises important questions concerning the method currently followed at INE concerning items with small number of transactions.