A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data

  • Authors:
  • Nicolas Bettenburg;Bram Adams;Ahmed E. Hassan;Michel Smidt

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICPC '11 Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical artifacts such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide range of applications from establishing traceability links to creating project-specific vocabularies. However, the lack of well-defined boundaries between natural language and technical content make the automated mining of technical artifacts challenging. As a first step towards a general-purpose technique to extracting technical artifacts from unstructured data, we present a lightweight approach to untangle technical artifacts and natural language text. Our approach is based on existing spell checking tools, which are well-understood, fast, readily available across platforms and impartial to different kinds of textual data. Through a handcrafted benchmark, we demonstrate that our approach is able to successfully uncover a wide range of technical artifacts in unstructured data.