Increasing diversity: Natural language measures for software fault prediction

Authors:
David Binkley;Henry Feild;Dawn Lawrie;Maurizio Pighin
Affiliations:
Loyola College Baltimore, MD 21210, USA;University of Massachusetts, Amherst, MA 01003, USA;Loyola College Baltimore, MD 21210, USA;Universitá degli Studi di Udine, Italy
Venue:
Journal of Systems and Software
Year:
2009

Citing 23
Cited 2

The Literate-Programming Paradigm

Computer
The Detection of Fault-Prone Programs

IEEE Transactions on Software Engineering
Programming pearls: literate programming

Communications of the ACM
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Software Defect Reduction Top 10 List

Computer
Empirically Guided Software Development Using Metric-Based Classification Trees

IEEE Software
A Metrics Suite for Object Oriented Design

IEEE Transactions on Software Engineering
Quantitative Analysis of Faults and Failures in a Complex Software System

IEEE Transactions on Software Engineering
Generating Robust Parsers using Island Grammars

WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
An XML-Based Lightweight C++ Fact Extractor

IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
An Analysis of Software Correctness Prediction Methods

APAQS '01 Proceedings of the Second Asia-Pacific Conference on Quality Software
Columbus - Reverse Engineering Tool and Schema for C++

ICSM '02 Proceedings of the International Conference on Software Maintenance (ICSM'02)
Concise and Consistent Naming

IWPC '05 Proceedings of the 13th International Workshop on Program Comprehension
Comparing High-Change Modules and Modules with the Highest Measurement Values in Two Large-Scale Open-Source Products

IEEE Transactions on Software Engineering
Empirical Validation of Object-Oriented Metrics on Open Source Software for Fault Prediction

IEEE Transactions on Software Engineering
Leveraged Quality Assessment using Information Retrieval Techniques

ICPC '06 Proceedings of the 14th IEEE International Conference on Program Comprehension
Looking for bugs in all the right places

Proceedings of the 2006 international symposium on Software testing and analysis
Syntactic Identifier Conciseness and Consistency

SCAM '06 Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation
Introduction to Statistical Methods and Data Analysis (with CD-ROM)

Introduction to Statistical Methods and Data Analysis (with CD-ROM)
Data Mining Static Code Attributes to Learn Defect Predictors

IEEE Transactions on Software Engineering
Software Fault Prediction using Language Processing

TAICPART-MUTATION '07 Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION
Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems

IEEE Transactions on Software Engineering

Reducing qualitative human oracle costs associated with automatically generated test data

Proceedings of the First International Workshop on Software Test Output Validation
Similarity mapping of software faults for self-healing applications

Proceedings of the 48th Annual Southeast Regional Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

While challenging, the ability to predict faulty modules of a program is valuable to a software project because it can reduce the cost of software development, as well as software maintenance and evolution. Three language-processing based measures are introduced and applied to the problem of fault prediction. The first measure is based on the usage of natural language in a program's identifiers. The second measure concerns the conciseness and consistency of identifiers. The third measure, referred to as the QALP score, makes use of techniques from information retrieval to judge software quality. The QALP score has been shown to correlate with human judgments of software quality. Two case studies consider the language processing measures applicability to fault prediction using two programs (one open source, one proprietary). Linear mixed-effects regression models are used to identify relationships between defects and the measures. Results, while complex, show that language processing measures improve fault prediction, especially when used in combination. Overall, the models explain one-third and two-thirds of the faults in the two case studies. Consistent with other uses of language processing, the value of the three measures increases with the size of the program module considered.