Text categorization with knowledge transfer from heterogeneous data sources

Authors:
Rakesh Gupta;Lev Ratinov
Affiliations:
Honda Research Institute USA Inc., Mountain View, CA;Department of Computer Science, University of Illinois, Urbana, IL
Venue:
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Year:
2008

Citing 15
Cited 12

Using linear algebra for intelligent information retrieval

SIAM Review
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Boosting Approach to Topic Spotting on Subdialogues

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
On the algorithmic implementation of multiclass kernel-based vector machines

The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Self-taught learning: transfer learning from unlabeled data

Proceedings of the 24th international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Common sense data acquisition for indoor mobile robots

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Domain adaptation for statistical classifiers

Journal of Artificial Intelligence Research

The ESA retrieval model revisited

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Robust distance metric learning with auxiliary knowledge

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Genetic transfer learning

Expert Systems with Applications: An International Journal
Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems (TOIS)
Knowledge transfer based on feature representation mapping for text classification

Expert Systems with Applications: An International Journal
Semantic translation for rule-based knowledge in data mining

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
An experimental comparison of explicit semantic analysis implementations for cross-language retrieval

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Exploiting Wikipedia for cross-lingual and multilingual information retrieval

Data & Knowledge Engineering
Concept labeling: building text classifiers with minimal supervision

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
A Dispatch-Mediated Communication Model for Emergency Response Systems

ACM Transactions on Management Information Systems (TMIS)
Enhancing short text clustering with small external repositories

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
User demographics prediction based on mobile data

Pervasive and Mobile Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-category classification of short dialogues is a common task performed by humans. When assigning a question to an expert, a customer service operator tries to classify the customer query into one of N different classes for which experts are available. Similarly, questions on the web (for example questions at Yahoo Answers) can be automatically forwarded to a restricted group of people with a specific expertise. Typical questions are short and assume background world knowledge for correct classification. With exponentially increasing amount of knowledge available, with distinct properties (labeled vs unlabeled, structured vs unstructured), no single knowledge-transfer algorithm such as transfer learning, multi-task learning or selftaught learning can be applied universally. In this work we show that bag-of-words classifiers performs poorly on noisy short conversational text snippets. We present an algorithm for leveraging heterogeneous data sources and algorithms with significant improvements over any single algorithm, rivaling human performance. Using different algorithms for each knowledge source we use mutual information to aggressively prune features. With heterogeneous data sources including Wikipedia, Open Directory Project (ODP), and Yahoo Answers, we show 89.4% and 96.8% correct classification on Google Answers corpus and Switchboard corpus using only 200 features/class. This reflects a huge improvement over bag of words approaches and 48-65% error reduction over previously published state of art (Gabrilovich et. al. 2006).