Effects of data set features on the performances of classification algorithms

Authors:
Ohbyung Kwon;Jae Mun Sim
Affiliations:
School of Management, Kyung Hee University, 26 Kyunghee-daero, Dongdaemun-gu, Seoul 130-701, Republic of Korea;Department of International Management, Kyung Hee University, 26 Kyunghee-daero, Dongdaemun-gu, Seoul 130-701, Republic of Korea
Venue:
Expert Systems with Applications: An International Journal
Year:
2013

Citing 30
Cited 1

Instance-Based Learning Algorithms

Machine Learning
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Using Model Trees for Classification

Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Theoretical frameworks for data mining

ACM SIGKDD Explorations Newsletter
Asymptotic behaviors of support vector machines with Gaussian kernel

Neural Computation
Mining Customer Value: From Association Rules to Direct Marketing

Data Mining and Knowledge Discovery
Improvements to Platt's SMO Algorithm for SVM Classifier Design

Neural Computation
The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
A Data Complexity Analysis on Imbalanced Datasets and an Alternative Imbalance Recovering Strategy

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Classification of multi class dataset using wavelet power spectrum

Data Mining and Knowledge Discovery
Facing Imbalanced Classes through Aggregation of Classifiers

ICIAP '07 Proceedings of the 14th International Conference on Image Analysis and Processing
Top 10 algorithms in data mining

Knowledge and Information Systems
Mining functional dependencies from data

Data Mining and Knowledge Discovery
cAnt-Miner: An Ant Colony Classification Algorithm to Cope with Continuous Attributes

ANTS '08 Proceedings of the 6th international conference on Ant Colony Optimization and Swarm Intelligence
Filling in the Blanks - Krimp Minimisation for Missing Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Building a Decision Cluster Forest Model to Classify High Dimensional Data with Multi-classes

ACML '09 Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning
X-SOM and L-SOM: A double classification approach for missing value imputation

Neurocomputing
Knowledge Discovery with Support Vector Machines

Knowledge Discovery with Support Vector Machines
Active Learning Algorithm for Threshold of Decision Probability on Imbalanced Text Classification Based on Protein-Protein Interaction Documents

DSDE '10 Proceedings of the 2010 International Conference on Data Storage and Data Engineering
Domain-Driven Data Mining: Challenges and Prospects

IEEE Transactions on Knowledge and Data Engineering
An empirical study of classification algorithm evaluation for financial risk prediction

Applied Soft Computing
On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Comparison of SVM Kernel Functions for Breast Cancer Detection

CGIV '11 Proceedings of the 2011 Eighth International Conference Computer Graphics, Imaging and Visualization
CLIM: Closed Inclusion Dependency Mining in Databases

ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops
From Databases to Big Data

IEEE Internet Computing
Effect of SVM kernel functions on classification of vibration signals of a single point cutting tool

Expert Systems with Applications: An International Journal

Annotating mobile phone location data with activity purposes using machine learning algorithms

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

As the need to analyze big data sets grows dramatically, the role that classification algorithms play in data mining techniques also increases. Big data analysis requires more of the data sets' characteristics to be included, such as data structure, variety of sources, and the rate of update frequency. In this paper, we evaluate scenarios that examine which data set characteristics most affect the classification algorithms' performance. It is still a complex issue to determine which algorithm is how strong or how weak in relation to which data set. Thus, our research experimentally examines how data set characteristics affect algorithm performance, both in terms of accuracy and in elapsed time. To do so, we use a multiple regression method to evaluate the causality between data set characteristics as independent variables, and performance metrics as dependent variables. We also examine the role that classification algorithms play as moderator in this causality. All benchmark data sets in a UCI database are used that are fit to run the classification algorithm. Based on the results of the experiment, we discuss the requirements of legacy classification algorithms to address big data analysis in a new business intelligence era.