Software bertillonage: finding the provenance of an entity

  • Authors:
  • Julius Davies;Daniel M. German;Michael W. Godfrey;Abram Hindle

  • Affiliations:
  • University of Victoria, Canada;University of Victoria, Canada;University of Waterloo, Canada;University of California, Davis, USA

  • Venue:
  • Proceedings of the 8th Working Conference on Mining Software Repositories
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Deployed software systems are typically composed of many pieces, not all of which may have been created by the main development team. Often, the provenance of included components -- such as external libraries or cloned source code -- is not clearly stated, and this uncertainty can introduce technical and ethical concerns that make it difficult for system owners and other stakeholders to manage their software assets. In this work, we motivate the need for the recovery of the provenance of software entities by a broad set of techniques that could include signature matching, source code fact extraction, software clone detection, call flow graph matching, string matching, historical analyses, and other techniques. We liken our provenance goals to that of Bertillonage, a simple and approximate forensic analysis technique based on bio-metrics that was developed in 19th century France before the advent of fingerprints. As an example, we have developed a fast, simple, and approximate technique called anchored signature matching for identifying library version information within a given Java application. This technique involves a type of structured signature matching performed against a database of candidates drawn from the Maven2 repository, a 150GB collection of open source Java libraries. An exploratory case study using a proprietary e-commerce Java application illustrates that the approach is both feasible and effective.