Learning to rank duplicate bug reports

Authors:
Jian Zhou;Hongyu Zhang
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 26
Cited 0

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Machine Learning

Machine Learning
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient boosting algorithm for combining preferences

The Journal of Machine Learning Research
Discriminative models for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Who should fix this bug?

Proceedings of the 28th international conference on Software engineering
Adapting ranking SVM to document retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Optimisation methods for ranking functions with multiple parameters

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detection of Duplicate Defect Reports Using Natural Language Processing

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Learning to rank: from pairwise approach to listwise approach

Proceedings of the 24th international conference on Machine learning
FRank: a ranking method with fidelity loss

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
AdaRank: a boosting algorithm for information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An approach to detecting duplicate bug reports using natural language and execution information

Proceedings of the 30th international conference on Software engineering
Listwise approach to learning to rank: theory and algorithm

Proceedings of the 25th international conference on Machine learning
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to Rank for Information Retrieval

Foundations and Trends in Information Retrieval
A discriminative model approach for accurate duplicate bug report retrieval

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Detecting Duplicate Bug Report Using Character N-Gram-Based Features

APSEC '10 Proceedings of the 2010 Asia Pacific Software Engineering Conference
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management
Towards more accurate retrieval of duplicate bug reports

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25Fext.