Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

  • Authors:
  • Aleksander Kołcz;Joshua Alspector

  • Affiliations:
  • Personalogy, Inc., 24 South Weber Suite 325, Colorado Springs, CO 80903, USA. ark@eas.uccs.edu;Department of Electrical and Computer Engineering, University of Colorado at Colorado Springs, 1420 Austin Bluffs Pkwy., Colorado Springs, CO 80918, USA. josh@eas.uccs.edu

  • Venue:
  • Information Retrieval
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.