Fit or unfit: analysis and prediction of 'closed questions' on stack overflow

Authors:
Denzil Correa;Ashish Sureka
Affiliations:
IIIT-Delhi, Delhi, India;IIIT-Delhi, Delhi, India
Venue:
Proceedings of the first ACM conference on Online social networks
Year:
2013

Citing 11
Cited 1

Stochastic gradient boosting

Computational Statistics & Data Analysis - Nonlinear methods and data mining
A framework to predict the quality of answers with non-textual features

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
Evaluating and predicting answer quality in community QA

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Using graded-relevance metrics for evaluating community QA answer selection

Proceedings of the fourth ACM international conference on Web search and data mining
Design lessons from the fastest q&a site in the west

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Analyzing and predicting question quality in community question answering services

Proceedings of the 21st international conference companion on World Wide Web
Discovering value from community activity on focused question answering sites: a case study of stack overflow

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Deficient documentation detection: a methodology to locate deficient project documentation using topic analysis

Proceedings of the 10th Working Conference on Mining Software Repositories
An exploratory analysis of mobile development issues using stack overflow

Proceedings of the 10th Working Conference on Mining Software Repositories

Chaff from the wheat: characterization and modeling of deleted questions on stack overflow

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as `closed' by experienced users and community moderators. A question can be `closed' for five reasons -- duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of `closed' questions on Stack Overflow. We download 4 years of publicly available data which contains 3.4 Million questions. We first analyze and characterize the complete set of 0.1 Million `closed' questions. Next, we use a machine learning framework and build a predictive model to identify a `closed' question at the time of question creation. One of our key findings is that despite being marked as `closed', subjective questions contain high information value and are very popular with the users. We observe an increasing trend in the percentage of closed questions over time and find that this increase is positively correlated to the number of newly registered users. In addition, we also see a decrease in community participation to mark a `closed' question which has led to an increase in moderation job time. We also find that questions closed with the Duplicate and Off Topic labels are relatively more prone to reputation gaming. Our analysis suggests broader implications for content quality maintenance on CQA websites. For the `closed' question prediction task, we make use of multiple genres of feature sets based on - user profile, community process, textual style and question content. We use a state-of-art machine learning classifier based on an ensemble framework and achieve an overall accuracy of 70.3%. Analysis of the feature space reveals that `closed' questions are relatively less informative and descriptive than non-`closed' questions. To the best of our knowledge, this is the first experimental study to analyze and predict `closed' questions on Stack Overflow.