A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions

Authors:
Hui Yang;Ajay Mysore;Sharonda Wallace
Affiliations:
Department of Computer Science, San Francisco State University, USA 94132;Department of Computer Science, San Francisco State University, USA 94132;Human Nutrition & Food Science, California State Polytechnic University, USA 91768
Venue:
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Year:
2009

Citing 15
Cited 0

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Concept decompositions for large sparse text data using clustering

Machine Learning
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
CBC: Clustering Based Text Classification Requiring Minimal Labeled Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Mining: Predictive Methods for Analyzing Unstructured Information

Text Mining: Predictive Methods for Analyzing Unstructured Information
Semi-supervised model-based document clustering: A comparative study

Machine Learning
Ontology Based Clustering for Improving Genomic IR

CBMS '07 Proceedings of the Twentieth IEEE International Symposium on Computer-Based Medical Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an algorithm to effectively cluster a specific type of text documents: textual responses gathered through a survey system. Due to the peculiar features exhibited in such responses (e.g., short in length, rich in outliers, and diverse in categories), traditional unsupervised and semi-supervised clustering* techniques are challenged to achieve satisfactory performance as demanded by a survey task. We address this issue by proposing a semi-supervised, topic-driven approach. It first employs an unsupervised algorithm to generate a preliminary clustering schema for all the answers to a question. A human expert then uses this schema to identify the major topics in these answers. Finally, a topic-driven clustering algorithm is adopted to obtain an improved clustering schema. We evaluated this approach using five questions in a survey we recently conducted in the U.S. The results demonstrate that this approach can lead to significant improvement in clustering quality.