On the dataset shift problem in software engineering prediction models

Authors:
Burak Turhan
Affiliations:
Department of Information Processing Science, University of Oulu, Oulu, Finland 90014
Venue:
Empirical Software Engineering
Year:
2012

Citing 28
Cited 3

Comparing Software Prediction Techniques Using Simulation

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Software Cost Estimation with Cocomo II with Cdrom

Software Cost Estimation with Cocomo II with Cdrom
Organizational Benchmarking Using the ISBSG Data Repository

IEEE Software
Assessing the applicability of fault-proneness models across object-oriented software projects

IEEE Transactions on Software Engineering
Software Quality Prediction Using Mixture Models with EM Algorithm

APAQS '00 Proceedings of the The First Asia-Pacific Conference on Quality Software (APAQS'00)
How Valuable is company-specific Data Compared to multi-company Data for Software Cost Estimation?

METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
Visually mining and monitoring massive time series

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning Weighted Naive Bayes with Accurate Ranking

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Reliability and Validity in Comparative Studies of Software Prediction Models

IEEE Transactions on Software Engineering
Cost curves: An improved method for visualizing classifier performance

Machine Learning
Cross versus Within-Company Cost Estimation Studies: A Systematic Review

IEEE Transactions on Software Engineering
Building Software Cost Estimation Models using Homogenous Data

ESEM '07 Proceedings of the First International Symposium on Empirical Software Engineering and Measurement
Implications of ceiling effects in defect predictors

Proceedings of the 4th international workshop on Predictor models in software engineering
Techniques for evaluating fault prediction models

Empirical Software Engineering
Analogy-X: Providing Statistical Inference to Analogy-Based Software Cost Estimation

IEEE Transactions on Software Engineering
Dataset Shift in Machine Learning

Dataset Shift in Machine Learning
Cost Curve Evaluation of Fault Prediction Models

ISSRE '08 Proceedings of the 2008 19th International Symposium on Software Reliability Engineering
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Conceptual Association of Functional Size Measurement Methods

IEEE Software
Cross-project defect prediction: a large scale experiment on data vs. domain vs. process

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
On the relative value of cross-company and within-company data for defect prediction

Empirical Software Engineering
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A new perspective on data homogeneity in software cost estimation: a study in the embedded systems domain

Software Quality Control
Introduction to Machine Learning

Introduction to Machine Learning
Discriminative Learning Under Covariate Shift

The Journal of Machine Learning Research
Stable rankings for different effort models

Automated Software Engineering
When to use data from other projects for effort estimation

Proceedings of the IEEE/ACM international conference on Automated software engineering
How to Find Relevant Data for Effort Estimation?

ESEM '11 Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement

Special issue on repeatable results in software engineering prediction

Empirical Software Engineering
Data science for software engineering

Proceedings of the 2013 International Conference on Software Engineering
Beyond data mining; towards "idea engineering"

Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A core assumption of any prediction model is that test data distribution does not differ from training data distribution. Prediction models used in software engineering are no exception. In reality, this assumption can be violated in many ways resulting in inconsistent and non-transferrable observations across different cases. The goal of this paper is to explain the phenomena of conclusion instability through the dataset shift concept from software effort and fault prediction perspective. Different types of dataset shift are explained with examples from software engineering, and techniques for addressing associated problems are discussed. While dataset shifts in the form of sample selection bias and imbalanced data are well-known in software engineering research, understanding other types is relevant for possible interpretations of the non-transferable results across different sites and studies. Software engineering community should be aware of and account for the dataset shift related issues when evaluating the validity of research outcomes.