Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

Authors:
Aleksander Kołcz;Joshua Alspector
Affiliations:
Personalogy, Inc., 24 South Weber Suite 325, Colorado Springs, CO 80903, USA. ark@eas.uccs.edu;Department of Electrical and Computer Engineering, University of Colorado at Colorado Springs, 1420 Austin Bluffs Pkwy., Colorado Springs, CO 80918, USA. josh@eas.uccs.edu
Venue:
Information Retrieval
Year:
2002

Citing 32
Cited 2

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Ranking algorithms

Information retrieval
Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Automating the creation of information filters

Communications of the ACM - Special issue on information filtering
Personalized information delivery: an analysis of information filtering methods

Communications of the ACM - Special issue on information filtering
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Information filtering based on user behavior analysis and best match text retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
The nature of statistical learning theory

The nature of statistical learning theory
Optimization of relevance feedback weights

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
ANATAGONOMY: a personalized newspaper on the World Wide Web

International Journal of Human-Computer Studies - Special issue: innovative applications of the World Wide Web
Learning routing queries in a query zone

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Learning and Revising User Profiles: The Identification ofInteresting Web Sites

Machine Learning - Special issue on multistrategy learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Text retrieval and filtering: analytic models of performance

Text retrieval and filtering: analytic models of performance
Making large-scale support vector machine learning practical

Advances in kernel methods
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A hybrid user model for news story classification

UM '99 Proceedings of the seventh international conference on User modeling
Active learning using adaptive resampling

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Information Retrieval

Information Retrieval
Learning When Negative Examples Abound

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Letizia: an agent that assists web browsing

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Erratum: Erratum to "Knowledge discovery and knowledge validation in intensive care"

Artificial Intelligence in Medicine

Summarization as feature selection for text categorization

Proceedings of the tenth international conference on Information and knowledge management
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets

Quantified Score

Hi-index	0.00

Visualization

Abstract

In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.