Learning website hierarchies for keyword enrichment in contextual advertising

Authors:
Pavan Kumar GM;Krishna P. Leela;Mehul Parsana;Sachin Garg
Affiliations:
Yahoo! Labs, Bangalore, India;Microsoft adCenter, Bangalore, India;Microsoft adCenter, Bangalore, India;Yahoo! Labs, Bangalore, India
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 16
Cited 1

A belief network model for IR

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Induction of Decision Trees

Machine Learning
Impedance coupling in content-targeted advertising

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Finding advertising keywords on web pages

Proceedings of the 15th international conference on World Wide Web
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A semantic approach to contextual advertising

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Just-in-time contextual advertising

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A noisy-channel approach to contextual advertising

Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining search engine query logs via suggestion sampling

Proceedings of the VLDB Endowment
Estimating the impressionrank of web pages

Proceedings of the 18th international conference on World wide web
Nearest-neighbor caching for content-match applications

Proceedings of the 18th international conference on World wide web
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining

Mining taxonomies from web menus: rule-based concepts and algorithms

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Contextual advertising, textual ads relevant to the content in a webpage are embedded in the page. Content keywords are extracted offline by crawling webpages and then stored in an index for fast serving. Given a page, ad selection involves index lookup, computing similarity between the keywords of the page and those of candidate ads and returning the top-k scoring ads. In this approach, ad relevance can suffer in two scenarios. First, since page-ad similarity is computed using keywords extracted only from that particular page, a few non pertinent keywords can skew ad selection. Second, requesting page may not be present in the index but we still need to serve relevant ads. We propose a novel mechanism to mitigate these problems in the same framework. The basic idea is to enrich keywords of a particular page with keywords from other but "similar" pages. The scheme involves learning a website specific hierarchy from (page, URL) pairs of the website. Next, keywords are populated on the nodes via successive top-down and bottom-up iterations over the hierarchy. We evaluate our approach on three data sets, one small human labeled set and two large-scale sets from Yahoo's contextual advertising system. Empirical evaluation show that ads fetched by enriching keywords has 2-3% higher nDCG compared to ads fetched based on a recent semantic approach even though the index size of our approach is 7 times less than the index size of semantic approach. Evaluation over pages which are not present in the index shows that ads fetched by our method has 6-7% higher nDCG compared to ads fetched based on a recent approach which uses first N bytes of the page content. Scalability is demonstrated via map-reduce adoption of our method and training on a large data set of 220 million pages from 95,104 websites.