Detecting similar software applications

Authors:
Collin McMillan;Mark Grechanik;Denys Poshyvanyk
Affiliations:
College of William and Mary, USA;Accenture Technology Labs, USA / University of Illinois at Chicago, USA;College of William and Mary, USA
Venue:
Proceedings of the 34th International Conference on Software Engineering
Year:
2012

Citing 44
Cited 2

The vocabulary problem in human-system communication

Communications of the ACM
Software reuse

ACM Computing Surveys (CSUR)
Program understanding and the concept assignment problem

Communications of the ACM
Supporting the construction and evolution of component repositories

Proceedings of the 18th international conference on Software engineering
Relevance: the whole history

Journal of the American Society for Information Science - Special topic issue on the history of documentation and information science: part II
Assessing software libraries by browsing similar classes, functions and relationships

Proceedings of the 21st international conference on Software engineering
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Supporting reuse by delivering task-relevant and personalized information

Proceedings of the 24th International Conference on Software Engineering
Component rank: relative significance rank for software component search

Proceedings of the 25th International Conference on Software Engineering
Hipikat: recommending pertinent software development artifacts

Proceedings of the 25th International Conference on Software Engineering
Assessing the relevance of identifier names in a legacy software system

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Program representation and behavioural matching for localizing similar code fragments

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: software engineering - Volume 1
Requirements Engineering

Requirements Engineering
Using structural context to recommend source code examples

Proceedings of the 27th international conference on Software engineering
Jungloid mining: helping to navigate the API jungle

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The computation of word associations: comparing syntagmatic and paradigmatic approaches

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Automatic generation of suggestions for program investigation

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Detecting similar Java classes using tree algorithms

Proceedings of the 2006 international workshop on Mining software repositories
GPLAG: detection of software plagiarism by program dependence graph analysis

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
MUDABlue: an automatic categorization system for open source repositories

Journal of Systems and Software - Special issue: Selected papers from the 11th Asia Pacific software engineering conference (APSEC 2004)
XSnippet: mining For sample code

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Mica: A Web-Search Tool for Finding API Components and Examples

VLHCC '06 Proceedings of the Visual Languages and Human-Centric Computing
Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval

IEEE Transactions on Software Engineering
Finding Relevant Applications for Prototyping

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Parseweb: a programmer assistant for reusing open source code on the web

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
A dynamic birthmark for java

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)

Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics)
Heavyweight Semantic Inducement for Requirement Elicitation and Analysis

SKG '07 Proceedings of the Third International Conference on Semantics, Knowledge and Grid
An approach to detecting duplicate bug reports using natural language and execution information

Proceedings of the 30th international conference on Software engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
A theory of aspects as latent topics

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
SNIFF: A Search Engine for Java Using Free-Form Queries

FASE '09 Proceedings of the 12th International Conference on Fundamental Approaches to Software Engineering: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Semantics-based code search

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Improving API documentation usability with knowledge pushing

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
SpotWeb: Detecting Framework Hotspots and Coldspots via Mining Open Source Code on the Web

ASE '08 Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering
A search engine for finding highly relevant applications

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
An empirical investigation into a large-scale Java open source code repository

Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Leveraging usage similarity for effective retrieval of examples in code repositories

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Software bertillonage: finding the provenance of an entity

Proceedings of the 8th Working Conference on Mining Software Repositories
Portfolio: finding relevant functions and their usage

Proceedings of the 33rd International Conference on Software Engineering
Categorizing software applications for maintenance

ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance

Rendezvous: a search engine for binary code

Proceedings of the 10th Working Conference on Mining Software Repositories
Extraction of product evolution tree from source code of product variants

Proceedings of the 17th International Software Product Line Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although popular text search engines allow users to retrieve similar web pages, source code search engines do not have this feature. Detecting similar applications is a notoriously difficult problem, since it implies that similar high-level requirements and their low-level implementations can be detected and matched automatically for different applications. We created a novel approach for automatically detecting Closely reLated ApplicatioNs (CLAN) that helps users detect similar applications for a given Java application. Our main contributions are an extension to a framework of relevance and a novel algorithm that computes a similarity index between Java applications using the notion of semantic layers that correspond to packages and class hierarchies. We have built CLAN and we conducted an experiment with 33 participants to evaluate CLAN and compare it with the closest competitive approach, MUDABlue. The results show with strong statistical significance that CLAN automatically detects similar applications from a large repository of 8,310 Java applications with a higher precision than MUDABlue.