Towards more accurate retrieval of duplicate bug reports

Authors:
Chengnian Sun;David Lo;Siau-Cheng Khoo;Jing Jiang
Affiliations:
School of Computing, National University of Singapore, Singapore;School of Information Systems, Singapore Management University, Singapore;School of Computing, National University of Singapore, Singapore;School of Information Systems, Singapore Management University, Singapore
Venue:
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Year:
2011

Citing 17
Cited 6

Machine Learning

Machine Learning
Automated support for classifying software failure reports

Proceedings of the 25th International Conference on Software Engineering
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Tree-Based Methods for Classifying Software Failures

ISSRE '04 Proceedings of the 15th International Symposium on Software Reliability Engineering
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Coping with an open bug repository

eclipse '05 Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange
Who should fix this bug?

Proceedings of the 28th international conference on Software engineering
A Linguistic Analysis of How People Describe Software Problems

VLHCC '06 Proceedings of the Visual Languages and Human-Centric Computing
Optimisation methods for ranking functions with multiple parameters

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detection of Duplicate Defect Reports Using Natural Language Processing

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Modeling bug report quality

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
An approach to detecting duplicate bug reports using natural language and execution information

Proceedings of the 30th international conference on Software engineering
Extracting structural information from bug reports

Proceedings of the 2008 international working conference on Mining software repositories
Introduction to Information Retrieval

Introduction to Information Retrieval
What makes a good bug report?

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
A discriminative model approach for accurate duplicate bug report retrieval

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Detecting Duplicate Bug Report Using Character N-Gram-Based Features

APSEC '10 Proceedings of the 2010 Asia Pacific Software Engineering Conference

Identifying Linux bug fixing patches

Proceedings of the 34th International Conference on Software Engineering
Duplicate bug report detection with a combination of information retrieval and topic modeling

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
Learning to rank duplicate bug reports

Proceedings of the 21st ACM international conference on Information and knowledge management
Taming compiler fuzzers

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Search-based duplicate defect detection: an industrial experience

Proceedings of the 10th Working Conference on Mining Software Repositories
A contextual approach towards more accurate duplicate bug report detection

Proceedings of the 10th Working Conference on Mining Software Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a bug tracking system, different testers or users may submit multiple reports on the same bugs, referred to as duplicates, which may cost extra maintenance efforts in triaging and fixing bugs. In order to identify such duplicates accurately, in this paper we propose a retrieval function (REP) to measure the similarity between two bug reports. It fully utilizes the information available in a bug report including not only the similarity of textual content in summary and description fields, but also similarity of non-textual fields such as product, component, version, etc. For more accurate measurement of textual similarity, we extend BM25F - an effective similarity formula in information retrieval community, specially for duplicate report retrieval. Lastly we use a two-round stochastic gradient descent to automatically optimize REP for specific bug repositories in a supervised learning manner. We have validated our technique on three large software bug repositories from Mozilla, Eclipse and OpenOffice. The experiments show 10 -- 27% relative improvement in recall rate@k and 17 -- 23% relative improvement in mean average precision over our previous model. We also applied our technique to a very large dataset consisting of 209,058 reports from Eclipse, resulting in a recall rate@k of 37 -- 71% and mean average precision of 47%.