Extending WHIRL with background knowledge for improved text classification

Authors:
Sarah Zelikovitz;William W. Cohen;Haym Hirsh
Affiliations:
Computer Science Department, College of Staten Island of CUNY, Staten Island, USA 10314;Center for Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, USA 15213;Department of Computer Science, Rutgers University, Piscataway, USA 08854-8019
Venue:
Information Retrieval
Year:
2007

Citing 0
Cited 3

Short text classification using very few words

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Extended information inference model for unsupervised categorization of web short texts

Journal of Information Science
An unsupervised transfer learning approach to discover topics for online reputation management

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for inductive classification. In particular, WHIRL is well suited for combining different sources of knowledge in the classification process. We show on a diverse set of tasks that the use of appropriate sets of unlabeled background knowledge often decreases error rates, particularly if the number of examples or the size of the strings in the training set is small. This is especially useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular problem on the World Wide Web.