Automating survey coding by multiclass text categorization techniques

Authors:
Daniela Giorgetti;Fabrizio Sebastiani
Affiliations:
Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche, 56124 Pisa, Italy;Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, 56124 Pisa, Italy
Venue:
Journal of the American Society for Information Science and Technology
Year:
2003

Citing 16
Cited 0

Optimizing convenient online access to bibliographic databases

Information Services and Use
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Document retrieval systems

Document retrieval systems
Readings in information retrieval

Readings in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Automatic essay grading using text categorization techniques

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Feature selection in SVM text categorization

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Simple Decomposition Method for Support Vector Machines

Machine Learning
Text classification using ESC-based stochastic decision lists

Information Processing and Management: an International Journal
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Performance Evaluation of Automatic Survey Classifiers

ICGI '98 Proceedings of the 4th International Colloquium on Grammatical Inference
On the algorithmic implementation of multiclass kernel-based vector machines

The Journal of Machine Learning Research
Automated scoring using a hybrid feature identification technique

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Survey coding is the task of assigning a symbolic code from a predefined set of such codes to the answer given in response to an open-ended question in a questionnaire (aka survey). This task is usually carried out to group respondents according to a predefined scheme based on their answers. Survey coding has several applications, especially in the social sciences, ranging from the simple classification of respondents to the extraction of statistics on political opinions, health and lifestyle habits, customer satisfaction, brand fidelity, and patient satisfaction. Survey coding is a difficult task, because the code that should be attributed to a respondent based on the answer she has given is a matter of subjective judgment, and thus requires expertise. It is thus unsurprising that this task has traditionally been performed manually, by trained coders. Some attempts have been made at automating this task, most of them based on detecting the similarity between the answer and textual descriptions of the meanings of the candidate codes. We take a radically new stand, and formulate the problem of automated survey coding as a text categorization problem, that is, as the problem of learning, by means of supervised machine learning techniques, a model of the association between answers and codes from a training set of precoded answers, and applying the resulting model to the classification of new answers. In this article we experiment with two different learning techniques: one based on naive Bayesian classification, and the other one based on multiclass support vector machines, and test the resulting framework on a corpus of social surveys. The results we have obtained significantly outperform the results achieved by previous automated survey coding approaches.