Content classification of development emails

Authors:
Alberto Bacchelli;Tommaso Dal Sasso;Marco D'Ambros;Michele Lanza
Affiliations:
University of Lugano, Switzerland;University of Lugano, Switzerland;University of Lugano, Switzerland;University of Lugano, Switzerland
Venue:
Proceedings of the 34th International Conference on Software Engineering
Year:
2012

Citing 30
Cited 1

A maximum entropy approach to natural language processing

Computational Linguistics
Factorial Hidden Markov Models

Machine Learning - Special issue on learning with probabilistic representations
Qualitative Methods in Empirical Studies of Software Engineering

IEEE Transactions on Software Engineering
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Recovering Traceability Links between Code and Documentation

IEEE Transactions on Software Engineering
Generating Robust Parsers using Island Grammars

WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
Island parsing and bidirectional charts

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 2
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Modeling history to analyze software evolution: Research Articles

Journal of Software Maintenance and Evolution: Research and Practice
Textual Allusions to Artifacts in Software-Related Repositories

Proceedings of the 2006 international workshop on Mining software repositories
An empirical comparison of supervised learning algorithms

ICML '06 Proceedings of the 23rd international conference on Machine learning
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Semantic clustering: Identifying topics in source code

Information and Software Technology
Detecting Patch Submission and Acceptance in OSS Projects

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Automatic summarising: The state of the art

Information Processing and Management: an International Journal
Seaside: A Flexible Environment for Building Dynamic Web Applications

IEEE Software
Extracting structural information from bug reports

Proceedings of the 2008 international working conference on Mining software repositories
Introduction to Information Retrieval

Introduction to Information Retrieval
Fair and balanced?: bias in bug-fix datasets

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Linking e-mails and source code artifacts

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Summarizing software artifacts: a case study of bug reports

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Supporting program comprehension with source code summarization

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
Extracting Source Code from E-Mails

ICPC '10 Proceedings of the 2010 IEEE 18th International Conference on Program Comprehension
What Makes a Good Bug Report?

IEEE Transactions on Software Engineering
A Case Study of Bias in Bug-Fix Datasets

WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
RTFM (Read the Factual Mails) - Augmenting Program Comprehension with Remail

CSMR '11 Proceedings of the 2011 15th European Conference on Software Maintenance and Reengineering
Non-essential changes in version histories

Proceedings of the 33rd International Conference on Software Engineering
Extracting structured data from natural language documents with island parsing

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Evaluating defect prediction approaches: a benchmark and an extensive comparison

Empirical Software Engineering

Detecting API documentation errors

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Emails related to the development of a software system contain information about design choices and issues encountered during the development process. Exploiting the knowledge embedded in emails with automatic tools is challenging, due to the unstructured, noisy, and mixed language nature of this communication medium. Natural language text is often not well-formed and is interleaved with languages with other syntaxes, such as code or stack traces. We present an approach to classify email content at line level. Our technique classifies email lines in five categories (i.e., text, junk, code, patch, and stack trace) to allow one to subsequently apply ad hoc analysis techniques for each category. We evaluated our approach on a statistically significant set of emails gathered from mailing lists of four unrelated open source systems.