Capturing programming content in online discussions

  • Authors:
  • Mahdy Khayyamian;Jihie Kim

  • Affiliations:
  • USC Information Sciences Institute, Marina del Rey, USA;USC Information Sciences Institute, Marina del Rey, USA

  • Venue:
  • Proceedings of the seventh international conference on Knowledge capture
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we introduce a new problem: automatically capturing programming content in online discussions. We expect solving this problem helps enhance visual presentation of programming forum content, qualitative analysis of forum contributions, and forum text preprocessing and normalization. We map this problem to a sequence learning problem and use Conditional Random Fields to solve it. We compare the performance with a word-feature based baseline and a nonsequence classification method (Naïve Bayes). The best results are produced by CRF method with an F1-Score as of 86.9%. Moreover, we demonstrate that the CRF classifier maintains a good accuracy across different domains; a model learned from a C++ forum performs almost as well on other programming language forums for Java and Python. As a demonstration of how captured information can be used, we provide an example of user profiling with programming content. In particular, we correlate the percentage of programming content in student answers to the student's course performance.