A study of the uniqueness of source code

Authors:
Mark Gabel;Zhendong Su
Affiliations:
University of California at Davis, Davis, CA, USA;University of California at Davis, Davis, CA, USA
Venue:
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Year:
2010

Citing 18
Cited 12

Genetic programming: on the programming of computers by means of natural selection

Genetic programming: on the programming of computers by means of natural selection
Retrieving reusable software by sampling behavior

ACM Transactions on Software Engineering and Methodology (TOSEM)
Analytical and empirical evaluation of software reuse metrics

Proceedings of the 18th international conference on Software engineering
CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
On finding duplication and near-duplication in large software systems

WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Sourcerer: a search engine for open source code supporting structure-based search

Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Large-Scale Code Reuse in Open Source Software

FLOSS '07 Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development
CodeGenie: using test-cases to search and reuse source code

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Effective phrase prediction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Semi-automating small-scale source code reuse via structural correspondence

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Semantics-based code search

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Automatically finding patches using genetic programming

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Automatic mining of functionally equivalent code fragments via random testing

Proceedings of the eighteenth international symposium on Software testing and analysis
Amassing and indexing a large sample of version control systems: Towards the census of public source code history

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories

Finding software license violations through binary code clone detection

Proceedings of the 8th Working Conference on Mining Software Repositories
Searching, selecting, and synthesizing source code

Proceedings of the 33rd International Conference on Software Engineering
Example embedding

Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Detecting similar software applications

Proceedings of the 34th International Conference on Software Engineering
On the naturalness of software

Proceedings of the 34th International Conference on Software Engineering
The GISMOE challenge: constructing the pareto program surface using genetic programming to find better programs (keynote paper)

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
A case study of cross-system porting in forked projects

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering
Mining source code repositories at massive scale using language modeling

Proceedings of the 10th Working Conference on Mining Software Repositories
Diversity in software engineering research

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
A statistical semantic language model for source code

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Proceedings of the 12th international conference on Generative programming: concepts & experiences

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the results of the first study of the uniqueness of source code. We define the uniqueness of a unit of source code with respect to the entire body of written software, which we approximate with a corpus of 420 million lines of source code. Our high-level methodology consists of examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that we call syntactic redundancy. We parameterized our study over a variety of variables, the most important of which being the level of granularity at which we view source code. Our suite of experiments together consumed approximately four months of CPU time, providing quantitative answers to the following questions: at what levels of granularity is software unique, and at a given level of granularity, how unique is software? While we believe these questions to be of intrinsic interest, we discuss possible applications to genetic programming and developer productivity tools.