Minority report in fraud detection: classification of skewed data

Authors:
Clifton Phua;Damminda Alahakoon;Vincent Lee
Affiliations:
Monash University, Clayton, Victoria, Australia;Monash University, Clayton, Victoria, Australia;Monash University, Clayton, Victoria, Australia
Venue:
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Year:
2004

Citing 21
Cited 40

Original Contribution: Stacked generalization

Neural Networks
Data mining with neural networks: solving business problems from application development to decision support

Data mining with neural networks: solving business problems from application development to decision support
Data preparation for data mining

Data preparation for data mining
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Data mining: concepts and techniques

Data mining: concepts and techniques
Robust Classification for Imprecise Environments

Machine Learning
Mastering Data Mining: The Art and Science of Customer Relationship Management

Mastering Data Mining: The Art and Science of Customer Relationship Management
Adaptive Fraud Detection

Data Mining and Knowledge Discovery
On Comparing Classifiers: Pitfalls toAvoid and a Recommended Approach

Data Mining and Knowledge Discovery
Distributed Data Mining in Credit Card Fraud Detection

IEEE Intelligent Systems
Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy

Machine Learning
Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases

AI '97 Proceedings of the 10th Australian Joint Conference on Artificial Intelligence: Advanced Topics in Artificial Intelligence
Using ethnography to design a mass detection tool (MDT) for the early discovery of insurance fraud

CHI '03 Extended Abstracts on Human Factors in Computing Systems
Detecting fraud in the real world

Handbook of massive data sets
Management of intelligent learning agents in distributed data mining systems

Management of intelligent learning agents in distributed data mining systems
Is Combining Classifiers with Stacking Better than Selecting the Best One?

Machine Learning
On Data and Algorithms: Understanding Inductive Performance

Machine Learning
The class imbalance problem: A systematic study

Intelligent Data Analysis
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
No free lunch theorems for optimization

IEEE Transactions on Evolutionary Computation

Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

The Journal of Machine Learning Research
Classifying imbalanced data using a bagging ensemble variation (BEV)

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Conflict-sensitivity contexture learning algorithm for mining interesting patterns using neuro-fuzzy network with decision rules

Expert Systems with Applications: An International Journal
Improving railroad wheel inspection planning using classification methods

AIAP'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications
An Evaluation of the Robustness of MTS for Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Back propagation networks for credit card fraud prediction using stratified personalized data

ISP'06 Proceedings of the 5th WSEAS International Conference on Information Security and Privacy
An application of supervised and unsupervised learning approaches to telecommunications fraud detection

Knowledge-Based Systems
Ontology-Based Fraud Detection

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets

International Journal of Approximate Reasoning
On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets

Expert Systems with Applications: An International Journal
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Anomaly detection using manifold embedding and its applications in transportation corridors

Intelligent Data Analysis - Knowledge Discovery from Data Streams
Towards fraud detection support using grid technology

Multiagent and Grid Systems - New tendencies on agents and grid environments
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets

Information Sciences: an International Journal
Selective costing ensemble for handling imbalanced data sets

International Journal of Hybrid Intelligent Systems
An unbalanced data classification model using hybrid sampling technique for fraud detection

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
Comparative analysis of data mining techniques for financial data using parallel processing

Proceedings of the 7th International Conference on Frontiers of Information Technology
A hybrid fraud scoring and spike detection technique in streaming data

Intelligent Data Analysis
Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies

Expert Systems with Applications: An International Journal
Anomaly detection in monitoring sensor data for preventive maintenance

Expert Systems with Applications: An International Journal
Active learning and subspace clustering for anomaly detection

Intelligent Data Analysis
Detecting fraud in online games of chance and lotteries

Expert Systems with Applications: An International Journal
Anomaly detection in categorical datasets using bayesian networks

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part II
Classification cost: An empirical comparison among traditional classifier, Cost-Sensitive Classifier, and MetaCost

Expert Systems with Applications: An International Journal
Testing the fraud detection ability of different user profiles by means of FF-NN classifiers

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part II
An empirical study of bagging predictors for imbalanced data with different levels of class distribution

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Methodology for fraud detection in electronic transactions

Proceedings of the 18th Brazilian symposium on Multimedia and the web
Improving risk predictions by preprocessing imbalanced credit data

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II
Multi-level relationship outlier detection

International Journal of Business Intelligence and Data Mining
A new probabilistic active sample selection algorithm for class imbalance problem

International Journal of Knowledge Engineering and Soft Data Paradigms
The fuzzy Laplacianclassifier

Neurocomputing
Metafraud: a meta-learning framework for detecting financial fraud

MIS Quarterly
Empirical study of bagging predictors on medical data

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques

Pattern Recognition Letters
Using social network knowledge for detecting spider constructions in social security fraud

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Effective detection of sophisticated online banking fraud on extremely imbalanced data

World Wide Web
Classification model for detecting and managing credit loan fraud based on individual-level utility concept

ACM SIGMIS Database
Classification model for detecting and managing credit loan fraud based on individual-level utility concept

ACM SIGMIS Database
Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset

Multimedia Tools and Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to deal with the data mining problem of skewed data distributions. This method uses backpropagation (BP), together with naive Bayesian (NB) and C4.5 algorithms, on data partitions derived from minority oversampling with replacement. Its originality lies in the use of a single meta-classifier (stacking) to choose the best base classifiers, and then combine these base classifiers' predictions (bagging) to improve cost savings (stacking-bagging). Results from a publicly available automobile insurance fraud detection data set demonstrate that stacking-bagging performs slightly better than the best performing bagged algorithm, C4.5, and its best classifier, C4.5 (2), in terms of cost savings. Stacking-bagging also outperforms the common technique used in industry (BP without both sampling and partitioning). Subsequently, this paper compares the new fraud detection method (meta-learning approach) against C4.5 trained using undersampling, oversampling, and SMOTEing without partitioning (sampling approach). Results show that, given a fixed decision threshold and cost matrix, the partitioning and multiple algorithms approach achieves marginally higher cost savings than varying the entire training data set with different class distributions. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.