A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions

  • Authors:
  • Hui Yang;Ajay Mysore;Sharonda Wallace

  • Affiliations:
  • Department of Computer Science, San Francisco State University, USA 94132;Department of Computer Science, San Francisco State University, USA 94132;Human Nutrition & Food Science, California State Polytechnic University, USA 91768

  • Venue:
  • ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose an algorithm to effectively cluster a specific type of text documents: textual responses gathered through a survey system. Due to the peculiar features exhibited in such responses (e.g., short in length, rich in outliers, and diverse in categories), traditional unsupervised and semi-supervised clustering* techniques are challenged to achieve satisfactory performance as demanded by a survey task. We address this issue by proposing a semi-supervised, topic-driven approach. It first employs an unsupervised algorithm to generate a preliminary clustering schema for all the answers to a question. A human expert then uses this schema to identify the major topics in these answers. Finally, a topic-driven clustering algorithm is adopted to obtain an improved clustering schema. We evaluated this approach using five questions in a survey we recently conducted in the U.S. The results demonstrate that this approach can lead to significant improvement in clustering quality.