Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Machine Learning
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Optimizing search engines using clickthrough data
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient boosting algorithm for combining preferences
The Journal of Machine Learning Research
Discriminative models for information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Learning to rank using gradient descent
ICML '05 Proceedings of the 22nd international conference on Machine learning
Proceedings of the 28th international conference on Software engineering
Adapting ranking SVM to document retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Optimisation methods for ranking functions with multiple parameters
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detection of Duplicate Defect Reports Using Natural Language Processing
ICSE '07 Proceedings of the 29th international conference on Software Engineering
Learning to rank: from pairwise approach to listwise approach
Proceedings of the 24th international conference on Machine learning
FRank: a ranking method with fidelity loss
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
AdaRank: a boosting algorithm for information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An approach to detecting duplicate bug reports using natural language and execution information
Proceedings of the 30th international conference on Software engineering
Listwise approach to learning to rank: theory and algorithm
Proceedings of the 25th international conference on Machine learning
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to Rank for Information Retrieval
Foundations and Trends in Information Retrieval
A discriminative model approach for accurate duplicate bug report retrieval
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Detecting Duplicate Bug Report Using Character N-Gram-Based Features
APSEC '10 Proceedings of the 2010 Asia Pacific Software Engineering Conference
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
Towards more accurate retrieval of duplicate bug reports
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Hi-index | 0.00 |
For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25Fext.