An evaluation of classification models for question topic categorization

Authors:
Bo Qu;Gao Cong;Cuiping Li;Aixin Sun;Hong Chen
Affiliations:
Information School, Renmin University of China, China100872;Nanyang Technological University, Blk N4, 50 Nanyang Avenue, Singapore639798;Information School, Renmin University of China, China100872;Nanyang Technological University, Blk N4, 50 Nanyang Avenue, Singapore639798;Information School, Renmin University of China, China100872
Venue:
Journal of the American Society for Information Science and Technology
Year:
2012

Citing 32
Cited 3

Yahoo! as an ontology: using Yahoo! categories to describe documents

Proceedings of the eighth international conference on Information and knowledge management
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierarchical Text Classification and Evaluation

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Hierarchical document categorization with support vector machines

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Learning question classifiers

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Optimization, maxent models, and conditional estimation without magic

NAACL-Tutorials '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials - Volume 5
Improving Automatic Query Classification via Semi-Supervised Learning

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Question classification using HDAG kernel

MultiSumQA '03 Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering - Volume 12
Building a reusable test collection for question answering

Journal of the American Society for Information Science and Technology - Research Articles
Hierarchical classification: combining Bayes with SVM

ICML '06 Proceedings of the 23rd international conference on Machine learning
Question classification with log-linear models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Query enrichment for web-query classification

ACM Transactions on Information Systems (TOIS)
Robust classification of rare queries using web knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Varying approaches to topical web query classification

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Discovering key concepts in verbose queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Deep classification in large-scale text hierarchies

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A sequential dual method for large scale multi-class linear svms

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Refined experts: improving classification in large taxonomies

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A syntactic tree matching approach to finding similar questions in community-based qa services

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The use of categorization information in language models for question retrieval

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Context-based term frequency assessment for text classification

Journal of the American Society for Information Science and Technology
A generalized framework of exploring category information for question retrieval in community question answer archives

Proceedings of the 19th international conference on World wide web
Text-based video content classification for online video-sharing sites

Journal of the American Society for Information Science and Technology
A survey of hierarchical classification across different application domains

Data Mining and Knowledge Discovery
Re-ranking question search results by clustering questions

Journal of the American Society for Information Science and Technology
Social Q&A

Journal of the American Society for Information Science and Technology
Interaction Analysis of the ALICE Chatterbot: A Two-Study Investigation of Dialog and Domain Questioning

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Category hierarchy maintenance: a data-driven approach

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Community question topic categorization via hierarchical kernelized classification

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Joint question clustering and relevance prediction for open domain non-factoid question answering

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset comprises 3.9 million questions and these questions are organized into more than 1,000 categories in a hierarchy. To the best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the following in classifying questions into CQA categories: (a) the usefulness of n-gram features and bag-of-word features; (b) the performance of three standard classification algorithms (naive Bayes, maximum entropy, and support vector machines); (c) the performance of the state-of-the-art hierarchical classification algorithms; (d) the effect of training data size on performance; and (e) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems. © 2012 Wiley Periodicals, Inc.