Categorization of display ads using image and landing page features

Authors:
Andrew Kae;Kin Kan;Vijay K. Narayanan;Dragomir Yankov
Affiliations:
University of Massachusetts, Amherst MA;Yahoo! Labs, Santa Clara CA;Yahoo! Labs, Santa Clara CA;Yahoo! Labs, Santa Clara CA
Venue:
Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Year:
2011

Citing 16
Cited 0

TextFinder: An Automatic System to Detect and Recognize Text In Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Transforming classifier scores into accurate multiclass probability estimates

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Predicting good probabilities with supervised learning

ICML '05 Proceedings of the 22nd international conference on Machine learning
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
A semantic approach to contextual advertising

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
TV ad video categorization with probabilistic latent concept learning

Proceedings of the international workshop on Workshop on multimedia information retrieval
An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
A noisy-channel approach to contextual advertising

Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Large scale multi-label classification via metalabeler

Proceedings of the 18th international conference on World wide web
Context transfer in search advertising

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Scene Text Recognition Using Similarity and a Lexicon with Sparse Belief Propagation

IEEE Transactions on Pattern Analysis and Machine Intelligence
What happens after an ad click?: quantifying the impact of landing pages in web advertising

Proceedings of the 18th ACM conference on Information and knowledge management
Using landing pages for sponsored search ad selection

Proceedings of the 19th international conference on World wide web
A large-scale active learning system for topical categorization on the web

Proceedings of the 19th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of automatically categorizing display ad images into a taxonomy of relevant interest categories. In particular, we focus on the efficacy of using image features extracted by OCR techniques from the ad images, in addition to the features from the text in the title, keywords and body of the landing page of the ad, and the features of the advertiser, in predicting the category of the display ad. An automated ad categorization tool has multiple uses in display advertising including increasing the ad categorization coverage, scaling up the ad categorization capacity to handle large volumes of ads by reducing the amount of human editorial effort and better utilizing the human editorial experts to focus on categorizing difficult ads. The ad image and landing page features extracted in this ad categorization system can also be used to improve the matching and ranking steps of ad selection algorithms in display ad serving systems. We learn multiple one-versus-rest SVM models to categorize the display ads, from a historical dataset of ads labeled into these categories by human editors. The OCR features extracted by common open source tools are by themselves noisy, and models trained using only the OCR features are not competitive with the performance of models trained using the landing page features. However, for categories with a small number of training examples, the OCR features improve the categorization performance metrics when used in addition to the features from the landing page. The OCR features also provide a useful signal to predict the category of an ad when features from the landing pages are not available. Our models have an average precision of 0.6 and recall of 0.37 over more than 1200 categories when evaluated on a hold out dataset. The precision and recall values are considerably higher for categories with larger amounts of training data, with precision larger than 0.84 and recall larger than 0.7 in all the categories that have more than 100,000 samples in the training dataset. Features from the text in the body of the landing page of the ads increase the recall of the categorization models and to a lesser extent increase the precision of these models, especially in categories with a smaller number of training samples.