A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Authors:
Neslihan Dogan;Zuhal Tanrikulu
Affiliations:
PricewaterhouseCoopers LLP, London, UK;Department of Management Information Systems, Bogazici University, Istanbul, Turkey
Venue:
Information Technology and Management
Year:
2013

Citing 39
Cited 0

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Instance-Based Learning Algorithms

Machine Learning
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
Comparing connectionist and symbolic learning methods

Proceedings of a workshop on Computational learning theory and natural learning systems (vol. 1) : constraints and prospects: constraints and prospects
Associative methods in reinforcement learning: an empirical study

Proceedings of the workshop on Computational learning theory and natural learning systems (vol. 2) : intersections between theory and experiment: intersections between theory and experiment
Support-Vector Networks

Machine Learning
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

Machine Learning
Principles of data mining

Principles of data mining
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Machine Learning

Machine Learning
Data Mining: Introductory and Advanced Topics

Data Mining: Introductory and Advanced Topics
Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results

Machine Learning
RSES and RSESlib - A Collection of Tools for Rough Set Computations

RSCTC '00 Revised Papers from the Second International Conference on Rough Sets and Current Trends in Computing
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

Data Mining and Knowledge Discovery
On Data and Algorithms: Understanding Inductive Performance

Machine Learning
Artificial Immune Recognition System (AIRS): An Immune-Inspired Supervised Learning Algorithm

Genetic Programming and Evolvable Machines
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Some Effective Techniques for Naive Bayes Text Classification

IEEE Transactions on Knowledge and Data Engineering
Data Mining: A Knowledge Discovery Approach

Data Mining: A Knowledge Discovery Approach
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Data mining methodological weaknesses and suggested fixes

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Analysis of breast feeding data using data mining methods

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Data mining for lifetime prediction of metallic components

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
To Select or To Weigh: A Comparative Study of Linear Combination Schemes for SuperParent-One-Dependence Estimators

IEEE Transactions on Knowledge and Data Engineering
The application of data mining techniques to characterize agricultural soil profiles

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
The use of various data mining and feature selection methods in the analysis of a population survey dataset

AIDM '07 Proceedings of the 2nd international workshop on Integrating artificial intelligence and data mining - Volume 84
Mining Supervised Classification Performance Studies: A Meta-Analytic Investigation

Journal of Classification
Reexamining the impact of information technology investment on productivity using regression tree and multivariate adaptive regression splines (MARS)

Information Technology and Management
Multiclass MTS for Simultaneous Feature Selection and Classification

IEEE Transactions on Knowledge and Data Engineering
Matchin: eliciting user preferences with an online game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
The Top Ten Algorithms in Data Mining

The Top Ten Algorithms in Data Mining
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
Identifying fall-related injuries: Text mining the electronic medical record

Information Technology and Management
Data Mining and Knowledge Discovery Handbook

Data Mining and Knowledge Discovery Handbook
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Learning and optimization using the clonal selection principle

IEEE Transactions on Evolutionary Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the performance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time consumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous variables are discretised; and third, implementing the algorithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets.