Simple instance selection for bankruptcy prediction

Authors:
Chih-Fong Tsai;Kai-Chun Cheng
Affiliations:
Department of Information Management, National Central University, Taiwan;Department of Information Management, National Central University, Taiwan
Venue:
Knowledge-Based Systems
Year:
2012

Citing 21
Cited 6

C4.5: programs for machine learning

C4.5: programs for machine learning
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Two-phase clustering process for outliers detection

Pattern Recognition Letters
Detecting graph-based spatial outliers: algorithms and applications (a summary of results)

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Instance Selection and Construction for Data Mining

Instance Selection and Construction for Data Mining
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using neural network ensembles for bankruptcy prediction and credit scoring

Expert Systems with Applications: An International Journal
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery
Detecting outlier samples in multivariate time series dataset

Knowledge-Based Systems
Feature selection in bankruptcy prediction

Knowledge-Based Systems
Finding key attribute subset in dataset for outlier detection

Knowledge-Based Systems
Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking

Knowledge-Based Systems
A Cluster Validity Measure With Outlier Detection for Support Vector Clustering

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Support vector machines for histogram-based image classification

IEEE Transactions on Neural Networks

Probabilistic outputs for twin support vector machines

Knowledge-Based Systems
On the use of data filtering techniques for credit risk prediction with instance-based models

Expert Systems with Applications: An International Journal
Control of discrete chaotic systems based on echo state network modeling with an adaptive noise canceler

Knowledge-Based Systems
Hybrid models based on rough set classifiers for setting credit rating decision rules in the global banking industry

Knowledge-Based Systems
Bi-objective feature selection for discriminant analysis in two-class classification

Knowledge-Based Systems
Fast instance selection for speeding up support vector machines

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance selection or outlier detection is an important task during data mining, which focuses on filtering out bad data from a given dataset. However, there is no rigid mathematical definition of what constitutes an outlier and an outlier is not a binary property. Therefore, different volumes of outliers may be detected depending on the setting of the threshold for what constitutes an outlier, e.g., the distance in distance-based outlier detection. In this study, we examine bankruptcy prediction performance achieved after removal of different outlier volumes from four widely used datasets, namely the Australian, German, Japanese, and UC Competition datasets. Specifically, a simple distance-based clustering outlier detection method is used. In addition, four popular classification techniques are compared, artificial neural networks, decision trees, logistic regression, and support vector machines. Experiments are conducted to examine (1) the prediction performance of the bankruptcy prediction models with and without instance selection, (2) the stability of bankruptcy prediction models after the removal of outliers from the testing set, and (3) the characteristics of these four different datasets. The results show that with the German dataset it is much more difficult for the prediction models to provide high rates of accuracy after outlier removal, while it is easier with the UC Competition dataset. Removing 50% of the outliers can lead to optimal performance of these four models. In addition, using the removed outliers to test the prediction accuracy of these models, we find that it is support vector machines (SVM) that provide the highest rate of prediction accuracy and perform with much more stability and good noise tolerance than the other three prediction models. Furthermore, the prediction accuracy of the SVM model followed by instance selection is similar to the one without instance selection (i.e., the SVM baseline). In other words, the difference in performance between the SVM and the SVM baseline is the least of the three models in comparison with their corresponding baselines.