Detecting Duplicate Bug Report Using Character N-Gram-Based Features

Authors:
Ashish Sureka;Pankaj Jalote
Affiliations:
-;-
Venue:
APSEC '10 Proceedings of the 2010 Asia Pacific Software Engineering Conference
Year:
2010

Citing 0
Cited 4

Towards more accurate retrieval of duplicate bug reports

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Identifying Linux bug fixing patches

Proceedings of the 34th International Conference on Software Engineering
Learning to rank duplicate bug reports

Proceedings of the 21st ACM international conference on Information and knowledge management
An Empirical Comparison of Machine Learning Techniques in Predicting the Bug Severity of Open and Closed Source Projects

International Journal of Open Source Software and Processes

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an approach to identify duplicate bug reports expressed in free-form text. Duplicate reports needs to be identified to avoid a situation where duplicate reports get assigned to multiple developers. Also, duplicate reports can contain complementary information which can be useful for bug fixing. Automatic identification of duplicate reports (from thousands of existing reports in a bug repository) can increase the productivity of a Triager by reducing the amount of time a Triager spends in searching for duplicate bug reports of any incoming report. The proposed method uses character N-gram-based model for the task of duplicate bug report detection. Previous approaches are word-based whereas this study investigates the usefulness of low-level features based on characters which have certain inherent advantages (such as natural-language independence, robustness towards noisy data and effective handling of domain specific term variations) over word-based features for the problem of duplicate bug report detection. The proposed solution is evaluated on a publicly-available dataset consisting of more than 200 thousand bug reports from the open-source Eclipse project. The dataset consists of ground-truth (pre-annotated dataset having bug reports tagged as duplicate by the Triager). Empirical results and evaluation metrics quantifying retrieval performance indicate that the approach is effective.