Assessing the quality and cleaning of a software project dataset: an experience report

Authors:
Gernot Liebchen;Bheki Twala;Martin Shepperd;Michelle Cartwright
Affiliations:
Brunel University, UK;Brunel University, UK;Brunel University, UK;Brunel University, UK
Venue:
EASE'06 Proceedings of the 10th international conference on Evaluation and Assessment in Software Engineering
Year:
2006

Citing 12
Cited 3

Robust regression and outlier detection

Robust regression and outlier detection
Simplifying decision trees

International Journal of Man-Machine Studies - Special Issue: Knowledge Acquisition for Knowledge-based Systems. Part 5
C4.5: programs for machine learning

C4.5: programs for machine learning
Overfitting Avoidance as Bias

Machine Learning
Machine Learning

Machine Learning
The CN2 Induction Algorithm

Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Outlier Detection Using Replicator Neural Networks

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois

ALT '96 Proceedings of the 7th International Workshop on Algorithmic Learning Theory
Noise and knowledge acquisition

IJCAI'87 Proceedings of the 10th international joint conference on Artificial intelligence - Volume 1
Evaluating noise correction

PRICAI'00 Proceedings of the 6th Pacific Rim international conference on Artificial intelligence

Data sets and data quality in software engineering

Proceedings of the 4th international workshop on Predictor models in software engineering
Sensitivity of results to different data quality meta-data criteria in the sample selection of projects from the ISBSG dataset

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
Data quality in empirical software engineering: a targeted review

Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

OBJECTIVE - The aim is to report upon an assessment of the impact noise has on the predictive accuracy by comparing noise handling techniques. METHOD - We describe the process of cleaning a large software management dataset comprising initially of more than 10,000 projects. The data quality is mainly assessed through feedback from the data provider and manual inspection of the data. Three methods of noise correction (polishing, noise elimination and robust algorithms) are compared with each other assessing their accuracy. The noise detection was undertaken by using a regression tree model. RESULTS - Three noise correction methods are compared and different results in their accuracy where noted. CONCLUSIONS - The results demonstrated that polishing improves classification accuracy compared to noise elimination and robust algorithms approaches.