Data mining source code for locating software bugs: A case study in telecommunication industry

Authors:
Burak Turhan;Gozde Kocak;Ayse Bener
Affiliations:
Dept. of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey;Dept. of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey;Dept. of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
Venue:
Expert Systems with Applications: An International Journal
Year:
2009

Citing 15
Cited 10

Software errors and complexity: an empirical investigation0

Communications of the ACM
A Critique of Software Defect Prediction Models

IEEE Transactions on Software Engineering
The distribution of faults in a large industrial software system

ISSTA '02 Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis
Quantitative Analysis of Faults and Failures in a Complex Software System

IEEE Transactions on Software Engineering
Module Size Distribution and Defect Density

ISSRE '00 Proceedings of the 11th International Symposium on Software Reliability Engineering
Where the bugs are

ISSTA '04 Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis
Predicting the Location and Number of Faults in Large Software Systems

IEEE Transactions on Software Engineering
An investigation of the effect of module size on defect prediction using static measures

PROMISE '05 Proceedings of the 2005 workshop on Predictor models in software engineering
Building Defect Prediction Models in Practice

IEEE Software
Looking for bugs in all the right places

Proceedings of the 2006 international symposium on Software testing and analysis
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Automating algorithms for the identification of fault-prone files

Proceedings of the 2007 international symposium on Software testing and analysis
A Multivariate Analysis of Static Code Attributes for Defect Prediction

QSIC '07 Proceedings of the Seventh International Conference on Quality Software
On the Distribution of Software Faults

IEEE Transactions on Software Engineering
Optimizing preventive service of software products

IBM Journal of Research and Development

Validation of network measures as indicators of defective modules in software systems

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Data mining application on crash simulation data of occupant restraint system

Expert Systems with Applications: An International Journal
Review: Software fault prediction: A literature review and current trends

Expert Systems with Applications: An International Journal
MACs: Mining API code snippets for code reuse

Expert Systems with Applications: An International Journal
Evaluating three approaches to extracting fault data from software change repositories

PROFES'10 Proceedings of the 11th international conference on Product-Focused Software Process Improvement
Coding-error based defects in enterprise resource planning software: Prevention, discovery, elimination and mitigation

Journal of Systems and Software
Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Cellular Customer Churns Due to Mobile Number Portability: Causes and the Strategies to Deal with it An Empirical Study

International Journal of Interdisciplinary Telecommunications and Networking
Comparison and evaluation of source code mining tools and techniques: A qualitative approach

Intelligent Data Analysis
DConfusion: a technique to allow cross study performance evaluation of fault prediction studies

Automated Software Engineering

Quantified Score

Hi-index	12.06

Visualization

Abstract

In a large software system knowing which files are most likely to be fault-prone is valuable information for project managers. They can use such information in prioritizing software testing and allocating resources accordingly. However, our experience shows that it is difficult to collect and analyze fine-grained test defects in a large and complex software system. On the other hand, previous research has shown that companies can safely use cross-company data with nearest neighbor sampling to predict their defects in case they are unable to collect local data. In this study we analyzed 25 projects of a large telecommunication system. To predict defect proneness of modules we trained models on publicly available Nasa MDP data. In our experiments we used static call graph based ranking (CGBR) as well as nearest neighbor sampling for constructing method level defect predictors. Our results suggest that, for the analyzed projects, at least 70% of the defects can be detected by inspecting only (i) 6% of the code using a Naive Bayes model, (ii) 3% of the code using CGBR framework.