C4.5: programs for machine learning
C4.5: programs for machine learning
LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Reduction Techniques for Instance-BasedLearning Algorithms
Machine Learning
Outlier detection for high dimensional data
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Two-phase clustering process for outliers detection
Pattern Recognition Letters
Detecting graph-based spatial outliers: algorithms and applications (a summary of results)
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Neural Networks: A Comprehensive Foundation
Neural Networks: A Comprehensive Foundation
Instance Selection and Construction for Data Mining
Instance Selection and Construction for Data Mining
Fast Outlier Detection in High Dimensional Spaces
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Finding Intensional Knowledge of Distance-Based Outliers
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Distance-based outliers: algorithms and applications
The VLDB Journal — The International Journal on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using neural network ensembles for bankruptcy prediction and credit scoring
Expert Systems with Applications: An International Journal
Fast mining of distance-based outliers in high-dimensional datasets
Data Mining and Knowledge Discovery
Detecting outlier samples in multivariate time series dataset
Knowledge-Based Systems
Feature selection in bankruptcy prediction
Knowledge-Based Systems
Finding key attribute subset in dataset for outlier detection
Knowledge-Based Systems
Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking
Knowledge-Based Systems
A Cluster Validity Measure With Outlier Detection for Support Vector Clustering
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Support vector machines for histogram-based image classification
IEEE Transactions on Neural Networks
Probabilistic outputs for twin support vector machines
Knowledge-Based Systems
On the use of data filtering techniques for credit risk prediction with instance-based models
Expert Systems with Applications: An International Journal
Bi-objective feature selection for discriminant analysis in two-class classification
Knowledge-Based Systems
Fast instance selection for speeding up support vector machines
Knowledge-Based Systems
Hi-index | 0.00 |
Instance selection or outlier detection is an important task during data mining, which focuses on filtering out bad data from a given dataset. However, there is no rigid mathematical definition of what constitutes an outlier and an outlier is not a binary property. Therefore, different volumes of outliers may be detected depending on the setting of the threshold for what constitutes an outlier, e.g., the distance in distance-based outlier detection. In this study, we examine bankruptcy prediction performance achieved after removal of different outlier volumes from four widely used datasets, namely the Australian, German, Japanese, and UC Competition datasets. Specifically, a simple distance-based clustering outlier detection method is used. In addition, four popular classification techniques are compared, artificial neural networks, decision trees, logistic regression, and support vector machines. Experiments are conducted to examine (1) the prediction performance of the bankruptcy prediction models with and without instance selection, (2) the stability of bankruptcy prediction models after the removal of outliers from the testing set, and (3) the characteristics of these four different datasets. The results show that with the German dataset it is much more difficult for the prediction models to provide high rates of accuracy after outlier removal, while it is easier with the UC Competition dataset. Removing 50% of the outliers can lead to optimal performance of these four models. In addition, using the removed outliers to test the prediction accuracy of these models, we find that it is support vector machines (SVM) that provide the highest rate of prediction accuracy and perform with much more stability and good noise tolerance than the other three prediction models. Furthermore, the prediction accuracy of the SVM model followed by instance selection is similar to the one without instance selection (i.e., the SVM baseline). In other words, the difference in performance between the SVM and the SVM baseline is the least of the three models in comparison with their corresponding baselines.